Forecasting wildfire risk in Northern California

CA Wildfire Prediction with Google Maps Video Link: https://youtu.be/7oWQHZ53C1A

0. Setup

1. Motivation

With climate change, wildfire has become more destructive to personal and property safety. There were more than one million fires that resulted in thousands of deaths and injuries in the U.S. And it is only for the year 2021. Among them, California is the most risky state. We are thinking, what if we can predict the wildfire to let people avoid the area in the first place. Well, in this model, we selected Tehama, Glenn, and Mendocino County in California as our testbed. We hope to partner with Google Maps to add a layer of wildfire prediction to the app. Basically, this app has two functions, alert, and route planning. We also hope to add the feature to the current route planning algorithm of google maps. The route planning algorithm will calculate the best route concerning the wildfire risks. For example, when users want to go to Paskenta, but there is a high probability that wildfire will hapMotivation: With climate change, wildfire has become more destructive to personal and property safety. In 2021, there were 1,353,500 fires resulting in 3,800 civilian deaths and 14,700 injuries in the United States. Among them, California is the most risky state. There were 6,994 fires that resulted in the burning of 2,569,459 acres of land, according to The Department of Forestry and Fire Protection.

Given those high numbers of loss of damages, we were thinking what if we can predict the risk level of wildfire to let people avoid the area or move/protect their properties before the fires happen. Therefore,we hope to partner with Google Maps (the map app that has the largest share of users in the map app market) to add a layer of wildfire prediction to the app. Basically, this app has two functions, alert, and route planning. For alert, when people use google maps, we want to inform them about the potential wildfire risk of their destinations by date. For route planning, we hope Google can integrate our wildfire prediction feature to their route planning algorithm.

In this model, we selected Tehama, Glenn, and Mendocino County in California as our testbed, and we gathered the data from the California Department of Forestry and Fire Protection. We used geographic data such as land cover, elevation, the nearest campground, as our location features, and time-related data such as weather and historic fire incidences as our time features. We used these as inputs to train our model and got the results of predicted wildfire location. Then, we used multiple methods to test the accuracy and generalizability of our model, and provided a cost-benefit analysis for Google Maps developers to review.

Ultimately, we would like to make a wildfire prediction layer to Google Maps to deliver wildfire risk notifications to Google Maps users. We hope our model can notify them to avoid potential wildfires at their destinations. In the longer term, we hope to test with more users, refine our model, and promote to a larger geographic area if it works well for the study area in California.

Note: As for now, because we want to predict the wildfire risk by day, our model will update and run itself everyday with the changing historic/time data. For example, if we want to predict the wildfire risk for Dec. 30th, 2021, the model will retrieve the wildfire history and weather data from Jan. 2nd to 2018 to Dec. 29, 2021. If we want to predict Dec. 30th, 2021, the model will use the wildfire history and weather data from Jan. 3rd, 2018 to Dec. 30th , 2021. The inputs of data will change every time in the model depending on the date the user wants to see the predicted result of. When developing the model with Google Maps, we hope to refine our time lag variables in the model to deliver a more time-sensitive and real-time prediction for our users. pen near the National Forest during the selected day. Although the fastest path to Paskenta will pass by the National Forest, the route planning algorithm will avoid this area and plan a new route for users’ safety.

2. Data Exploration

We pull in the geography data of three counties in Northern California and create fishnets in cellsize of 2000.

#all cali boundaries (counties)
counties <-  st_read("https://services1.arcgis.com/jUJYIo9tSA7EHvfZ/arcgis/rest/services/California_County_Boundaries/FeatureServer/0/query?outFields=*&where=1%3D1&f=geojson")  

#socal counties
norcal <- counties %>% 
  filter(COUNTY_NAME != "Kern" & COUNTY_NAME != "Riverside" & COUNTY_NAME != "Los Angeles" & COUNTY_NAME != "Imperial" & COUNTY_NAME != "San Bernardino" & COUNTY_NAME != "Ventura" & COUNTY_NAME != "Orange" & COUNTY_NAME != "Santa Barbara" & COUNTY_NAME != "San Luis Obispo" & COUNTY_NAME != "San Diego") %>%
  st_transform('ESRI:102242') 

norcal2 <- norcal %>% 
  filter(COUNTY_NAME == "Mendocino" | COUNTY_NAME == "Glenn" | COUNTY_NAME == "Tehama"  )%>%
  st_transform('ESRI:102242') 


# centroids of southern california counties
centroids_counties <- cbind(norcal2, st_coordinates(st_centroid(norcal2)))

# fishnet
fishnet <- 
  st_make_grid(norcal2,
               cellsize = 2000, 
               square = TRUE) %>%
  .[norcal2] %>%            # fast way to select intersecting polygons
  st_sf() %>%
  dplyr::mutate(uniqueID = row_number())

2.1 Wildfire Data

As we can see from the graph, the area in the middle had a lot of fire incidents in the past 3 years.

perimeters <- st_read("https://opendata.arcgis.com/datasets/e3802d2abf8741a187e73a9db49d68fe_2.geojson")

peri1820 <- perimeters %>%
  filter(YEAR_ > 2017)%>%
  st_transform('ESRI:102242') 

norcal_only <- 
  st_intersection(st_make_valid(norcal2),
                  peri1820)
 
ggplot()+
  geom_sf(data=norcal2, fill="gray", color="white") +
  geom_sf(data=norcal_only, fill="#f06813", color="#f06813", alpha=0.9) +
     geom_text(data = centroids_counties, aes(X, Y, label = COUNTY_NAME), size = 3, fontface="bold")+
  labs(title = "Fire Perimeters in Mendocino, Glenn, and Tehama, CA",
       subtitle="2018-2021",
       caption="Data Sources: CAL FIRE; California State Geoportal") + 
  mapTheme()

Then we divided the whole area by fishnet with a size of 2000 meters. Purple means the lowest occurrence, and yellow means the highest occurrence. We also mapped the burned and unburned land as the dependent variable, and as the map shows, the result corresponds to the fire incidence map.

fire_net <- 
  dplyr::select(peri1820) %>% 
  mutate(countFire = 1) %>% 
  aggregate(., fishnet, sum) %>%
  dplyr::mutate(countFire = replace_na(countFire, 0),
         uniqueID = 1:n(),
         cvID = sample(round(nrow(fishnet) / 24), 
                       size=nrow(fishnet), replace = TRUE))
fire_net <-
  fire_net %>%
  dplyr::mutate(burned = case_when(countFire == 0 ~ "no", 
                                   countFire != 0 ~ "yes"))

ggplot() +
  geom_sf(data = fire_net, aes(fill = countFire), color = NA) +
  scale_fill_viridis(option="C") +
  labs(title = "Count of Fire for the fishnet") +
  mapTheme()+ theme(legend.position='right')

ggplot() +
  geom_sf(data = fire_net, aes(fill = burned), color = NA) +
  labs(title = "Burned v Unburned for the fishnet") +
  scale_fill_manual(values = palette2)+
  mapTheme()

To find out the reasons for fires, we then tested with different variables.

2.2 Raster Data: Vegetation, Elevation

We explore land cover because we think some kinds of vegetation species may result in higher risk of wildfire. Vegetation is the fuel to wildfires. From the map, we can see that evergreen forests and shrubs made the highest share of land cover among all land cover types, indicating ample fuel for the occurrence of wildfire in this area. We also conducted mode analysis to find which vegetation species made up the highest concentration of vegetation in the area by fishnet.

land_cover <- raster("D:/Teresa/508final/nlcd_2019_land_cover_l48_20210604/nlcd_2019_land_cover_l48_20210604.img")

# #reclass <- c(0,10,0,
#              10,11,11,
#              11,12,12,
#              12,20,0,
#              20,21,21,
#              21,22,22,
#              22,23,23,
#              23,24,24,
#              24,30,0,
#              30,31,31,
#              31,40,0,
#              40,41,41,
#              41,42,42,
#              42,43,43,
#              44,51,0,
#              51,52,52,
#              52,70,0,
#              70,71,71,
#              71,80,0,
#              80,81,81,
#              81,82,82,
#              82,89,0,
#              89,90,90,
#              90,94,0,
#              94,95,95,
#              95,255,0
#              )
# class_matrix <- matrix(reclass, ncol = 3, byrow= TRUE)
#land_cover2 <- reclassify(land_cover, class_matrix)



#lc2 <- crop(y = norcal2 %>% st_transform(crs(land_cover)))
fishnet1 <- cbind(fire_net, exact_extract(land_cover, fishnet %>% st_transform(crs(land_cover)), 'majority'))
fishnet1 <- fishnet1 %>%
  dplyr::rename(landcover = exact_extract.land_cover..fishnet.....st_transform.crs.land_cover....) 


elevation <- raster("D:/Teresa/508final/elevation.tif")
#crop(y = norcal2 %>% st_transform(crs(elevation)))
fishnet2 <- cbind(fishnet1, exact_extract(elevation, fishnet %>% st_transform(crs(elevation)), 'mean')) 
fishnet2 <- fishnet2 %>%
  dplyr::rename(elevation = exact_extract.elevation..fishnet.....st_transform.crs.elevation....)

fishnet2 <- fishnet2%>% 
  dplyr::mutate(land_cover = case_when(landcover == 11 ~ "water",
                                       landcover == 12 ~ "perennial snow", 
                                       landcover == 21 ~ "developed, open space", 
                                       landcover == 22 ~ "developed, low intensity", 
                                       landcover == 23 ~ "developed, median intensity", 
                                       landcover == 24 ~ "developed, high intensity", 
                                       landcover == 31 ~ "barren", 
                                       landcover == 41 ~ "deciduous foreest", 
                                       landcover == 42 ~ "evergreen forest", 
                                       landcover == 43 ~ "mixed forest", 
                                       landcover == 52 ~ "shrub", 
                                       landcover == 71 ~ "herbaceous", 
                                       landcover == 81 ~ "hay", 
                                       landcover == 82 ~ "cultivated crops", 
                                       landcover == 90 ~ "woody wetlands", 
                                       landcover == 95 ~ "emerging herbaceous wetlands")) %>% 
  dplyr::select(-landcover)

ggplot() +
  geom_sf(data = fishnet2, aes(fill = land_cover), color = NA) +
  labs(title = "Land cover for the fishnet") +
  mapTheme()

We also consider the mean elevation as one of our features because different elevation slopes may cause different exposure levels of wind speed and humility. We gathered the data from the Multi-Resolution Land Characteristics (MRLC) consortium. Different exposure levels of these may cause different levels of wildfire risk. We found that the areas with highest elevation (2000 meters and up) are concentrated in the middle part of the study area, which corresponds to the result of the historic fire incidence map.

ggplot() +
  geom_sf(data = fishnet2, aes(fill = elevation), color = NA) +
  labs(title = "Mean Elevation for the fishnet") +
  mapTheme()

2.3 Weather Data

For weather, we gathered the data from ASOS (Automated Surface Observing Systems) weather stations from 2018 to 2021 for Tehama, Glenn, and Mendocino County in California. We found that average temperature is higher and wind speed is faster in the northeastern part of the study area, but the average humidity is higher in Southwestern part. It brings a higher wildfire risk for the northeastern part of the area, because the dryness, faster wind speed, and high temperature are some of the most direct causes of wildfire ignition and spread.

weather <- read.csv("D:/Teresa/508final/weather.csv") 

weather_annual <- weather %>% st_as_sf(coords = c("lon", "lat"), crs = 4236, agr="constant") %>% st_transform(st_crs("ESRI:102242"))

library(dplyr)
weather_annual <- weather_annual %>% dplyr::group_by(year, id) %>%
      dplyr::summarize(mean_ann_tmpf = mean(mean_tmpf, na.rm = TRUE),
                       mean_ann_precipitation = mean(mean_precipitation, na.rm = TRUE),
                       mean_ann_humidity = mean(mean_humidity, na.rm = TRUE),
                       mean_ann_windspeed = mean(mean_wind_Speed, na.rm = TRUE))

#create new df with only precipitation variable
weather_precip <- weather_annual %>% dplyr::select(geometry, mean_ann_precipitation)
weather_precip.idw <- gstat::idw(mean_ann_precipitation ~ 1, weather_precip, newdata=fishnet2, idp=2.0)
weather_precip.idw <- weather_precip.idw %>% dplyr::rename(avg_precip_idw = var1.pred)
precip.idw <- weather_precip.idw %>% dplyr::select(avg_precip_idw, geometry) 
# join to fishnet
fishnet2 <- st_join(fishnet2, precip.idw, join=st_equals)
 
library(dplyr)
#new df with only wind speed variable
weather_wind <- weather_annual %>% dplyr::select(geometry, mean_ann_windspeed)
weather_wind.idw <- gstat::idw(mean_ann_windspeed ~ 1, weather_wind, newdata=fishnet2, idp=2.0)
weather_wind.idw <- weather_wind.idw %>% dplyr::rename(avg_windspeed_idw = var1.pred)
avg_windspeed.idw <- weather_wind.idw %>% dplyr::select(avg_windspeed_idw, geometry) 
# join to fishnet
fishnet2 <- st_join(fishnet2, avg_windspeed.idw, join=st_equals)

weather_temp <- weather_annual %>% dplyr::select(geometry, mean_ann_tmpf)
weather_temp.idw <- gstat::idw(mean_ann_tmpf ~ 1, weather_temp, newdata=fishnet2, idp=2.0)
weather_temp.idw <- weather_temp.idw %>% dplyr::rename(avg_temp_idw = var1.pred)
avg_temp.idw <- weather_temp.idw %>% dplyr::select(avg_temp_idw, geometry) 
# join to fishnet
fishnet2 <- st_join(fishnet2, avg_temp.idw, join=st_equals)

library(dplyr)
weather_hum <- weather_annual %>% dplyr::select(geometry, mean_ann_humidity)
weather_hum.idw <- gstat::idw(mean_ann_humidity ~ 1, weather_hum, newdata=fishnet2, idp=2.0)
weather_hum.idw <- weather_hum.idw %>% dplyr::rename(avg_hum_idw = var1.pred)
avg_hum.idw <- weather_hum.idw %>% dplyr::select(avg_hum_idw, geometry) 
# join to fishnet
fishnet2 <- st_join(fishnet2, avg_hum.idw, join=st_equals)

precip <- ggplot()+
  geom_sf(data=weather_precip.idw, aes(fill=avg_precip_idw)) +
  scale_fill_viridis(option="C") +
  labs(title="Average Precipitation",
       subtitle="2018-2020",
       caption="Data source: NOAA",
       fill="Inches")+
          mapTheme() + theme(legend.position='right')

wind <- ggplot()+
  geom_sf(data=weather_wind.idw, aes(fill=avg_windspeed_idw)) +
  scale_fill_viridis(option="C") +
  labs(title="Average Wind Speed",
       subtitle="2018-2020",
       caption="Data source: NOAA",
       fill="Tenths m/sec")+
          mapTheme() + theme(legend.position='right')

temp <- ggplot()+
  geom_sf(data=weather_temp.idw, aes(fill=avg_temp_idw)) +
  scale_fill_viridis(option="C") +
  labs(title="Average Temperature",
       subtitle="2018-2020",
       caption="Data source: NOAA",
       fill="Degrees (F)")+
          mapTheme()+ theme(legend.position='right')

humidity <- ggplot()+
  geom_sf(data=weather_hum.idw, aes(fill=avg_hum_idw)) +
  scale_fill_viridis(option="C") +
  labs(title="Average Relative Humidity",
       subtitle="2018-2020",
       caption="Data source: NOAA",
       fill="Percentage")+
          mapTheme()+ theme(legend.position='right')

grid.arrange(ncol=2, precip, wind, temp, humidity)

#burned_num
fishnet2 <-
  fishnet2 %>%
  dplyr::mutate(burned_num = ifelse(countFire == 0, 0, 1))

2.4 Other Feature Engineering

For Tree mortality, more dead trees mean more fuel for fires to burn. We gathered the data from the California Open Data Portal and the result shows that those dead trees are concentrated in the middle and eastern part of the study area.

tier1hazard_treemortality <- st_read("https://opendata.arcgis.com/datasets/a71a85136b0b414ea734fdfbe3d7674a_0.geojson") %>% st_transform(st_crs(fishnet2))
fishnet2 <- fishnet2 %>% mutate(n_tier1hazard_int = lengths(st_intersects(fishnet2, tier1hazard_treemortality)))

ggplot()+
  geom_sf(data=fishnet2, aes(fill=n_tier1hazard_int)) +
  scale_fill_viridis(option="T") +
  labs(title="Tree Mortality Hazard Classification",
       fill="Amount",
  subtitle="Per Fishnet Cell",
  caption="Data source: California Open Data Portal")+
          mapTheme()

We also computed the distance to the nearest campground by using the knn function, because many of the wildfire cases were caused by campfires. Campgrounds are more likely to have campfires. We gathered the data from California Open Data Portal and we found that the campground spread across the study area.

#read in camp grounds
camp_ground <- st_read("D:/Teresa/508final/campground/Campgrounds.shp") %>% 
    dplyr::select(geometry) %>%
    st_transform(st_crs(fishnet2))

# calculate distance to nearest camp ground
fishnet2<-
  fishnet2 %>%
    mutate(dist_camp_ground=st_nn(st_centroid(fishnet2), camp_ground, 1))

fishnet2$dist_camp_ground <- as.numeric(fishnet2$dist_camp_ground)

ggplot()+
  geom_sf(data=fishnet2, aes(fill=dist_camp_ground)) +
  scale_fill_viridis(option="B") +
  labs(title="Distance to Nearest Camp Grounds",
       fill="Distance \n(meters)",
  subtitle="Per Fishnet Cell",
  caption="Data source: California Open Data Portal")+
          mapTheme()

We also explored the nearest electric transmission lines, electrical substation, and nearest power plants. We gathered the data from the California Open Data Portal. The high voltage electricity in those lines, plants, and stations may cause wildfire if it lacks maintenance. There will also be potential flammable liquid leaks and malfunctions at those stations and plants. We should also pay special attention to the high voltage electric transmission lines near the forests. As the maps show, the darker the cells, the closer the distance. And we can see it is concentrated in the Northeastern part of the study area, making it more vulnerable to the risk of wildfire.

#read in electric transmission
electric_transmission <- st_read("D:/Teresa/508final/California_Electric_Transmission_Lines.geojson") %>% 
    dplyr::select(geometry) %>%
    na.omit() %>% 
    st_transform(st_crs(fishnet2))


# calculate distance to nearest electric transmission 
fishnet2<-
  fishnet2 %>%
    mutate(dist_elec_tran=st_nn(st_centroid(fishnet2), electric_transmission, 1))

fishnet2$dist_elec_tran <- as.numeric(fishnet2$dist_elec_tran)

ggplot()+
  geom_sf(data=fishnet2, aes(fill=dist_elec_tran)) +
  scale_fill_viridis(option="E") +
  labs(title="Distance to Nearest Electirc Transmission",
       fill="Distance \n(meters)",
  subtitle="Per Fishnet Cell",
  caption="Data source: California Open Data Portal")+
          mapTheme()

#read in electric substations
electric_substations <- st_read("https://opendata.arcgis.com/datasets/7f37f2535d3144e898a53b9385737ee0_0.geojson") %>% 
  dplyr::select(Y = Lat, X = Lon) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet2))

# calculate distance to nearest electric substation 
fishnet2<-
  fishnet2 %>%
    mutate(dist_elec_substation=st_nn(st_centroid(fishnet2), electric_substations, 1))

fishnet2$dist_elec_substation <- as.numeric(fishnet2$dist_elec_substation)

ggplot()+
  geom_sf(data=fishnet2, aes(fill=dist_elec_substation)) +
  scale_fill_viridis(option="E") +
  labs(title="Distance to Nearest Electrical Substation",
       fill="Distance \n(meters)",
  subtitle="Per Fishnet Cell",
  caption="Data source: California Open Data Portal")+
          mapTheme()

fire_facilities <- st_read("https://gis.data.cnra.ca.gov/datasets/CALFIRE-Forestry::cal-fire-facilities-for-wildland-fire-protection.geojson?outSR=%7B%22latestWkid%22%3A3857%2C%22wkid%22%3A102100%7D") %>% 
  dplyr::select(Y = LAT, X = LON) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_zm()%>%
    st_transform(st_crs(fishnet2))


# calculate distance to nearest facility

fishnet2<-
  fishnet2 %>%
    mutate(dist_fire_facilities=st_nn(st_centroid(fishnet2), fire_facilities, 1))

fishnet2$dist_fire_facilities <- as.numeric(fishnet2$dist_fire_facilities)

power_plants <- st_read("https://services3.arcgis.com/bWPjFyq029ChCGur/arcgis/rest/services/Power_Plant/FeatureServer/0/query?outFields=*&where=1%3D1&f=geojson") %>% 
    dplyr::select(geometry) %>%
    na.omit() %>% 
    st_transform(st_crs(fishnet2))



# calculate distance to nearest power plants

fishnet2<-
  fishnet2 %>%
    mutate(dist_power_plants=st_nn(st_centroid(fishnet2), power_plants, 1))

fishnet2$dist_power_plants <- as.numeric(fishnet2$dist_power_plants)

ggplot()+
  geom_sf(data=fishnet2, aes(fill=dist_power_plants)) +
  scale_fill_viridis(option="E") +
  labs(title="Distance to Nearest Power Plants",
       fill="Distance \n(meters)",
  subtitle="Per Fishnet Cell",
  caption="Data source: California Open Data Portal")+
          mapTheme()

We also calculate the distance from each cell’s centroid to the nearest fire facility. The data is from the California Open Data Portal. We can see that there is a disproportionate distribution of fire facilities in that the northern part of the study area is closer to the contracted fire facilities than the southern part of the study area. It does not correlate much with the historic fire incidence map.

ggplot()+
  geom_sf(data=fishnet2, aes(fill=dist_fire_facilities)) +
  scale_fill_viridis(option="E") +
  labs(title="Distance to Nearest Fire Facilities",
       fill="Distance \n(meters)",
  subtitle="Per Fishnet Cell",
  caption="Data source: California Open Data Portal")+
          mapTheme()

#write.csv(fishnet2, "D:/Teresa/508final/fishnet2.csv")

3. Model

In this section, we use a logistic regression model which can predict a binary outcome, along with the variables and the new features. It is a well-suited model for our case of predicting the probability that a fire will occur. First, the data is divided into 75% of training datasets and 25% of testing datasets, and the model runs on the training set by using createDataPartition. Then, the regression model used all the variables including the engineered features that help improve the model. As a result, the model has the lowest AIC, which shows a better fit.

set.seed(3456)
trainIndex <- createDataPartition(fishnet2$burned, 
                                  p = .75,
                                  list = FALSE,
                                  times = 1)
fireTrain <- fishnet2[ trainIndex,]
fireTest  <- fishnet2[-trainIndex,]

reg1<- glm(burned_num ~ .,
                  data=st_drop_geometry(fireTrain) %>% dplyr::select(-uniqueID, -burned, -countFire, -cvID, -land_cover, -dist_elec_tran), family="binomial")
summary(reg1)

## 
## Call:
## glm(formula = burned_num ~ ., family = "binomial", data = st_drop_geometry(fireTrain) %>% 
##     dplyr::select(-uniqueID, -burned, -countFire, -cvID, -land_cover, 
##         -dist_elec_tran))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1772  -0.5114  -0.3179  -0.1380   2.9709  
## 
## Coefficients: (3 not defined because of singularities)
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           2.457e+01  5.045e+00   4.871 1.11e-06 ***
## elevation             3.635e-03  1.359e-04  26.759  < 2e-16 ***
## avg_precip_idw       -6.639e+02  1.204e+02  -5.515 3.49e-08 ***
## avg_windspeed_idw            NA         NA      NA       NA    
## avg_temp_idw                 NA         NA      NA       NA    
## avg_hum_idw                  NA         NA      NA       NA    
## n_tier1hazard_int    -9.250e-02  2.006e-02  -4.611 4.01e-06 ***
## dist_camp_ground      1.620e-03  5.473e-04   2.960  0.00307 ** 
## dist_elec_substation -8.112e-04  5.557e-05 -14.599  < 2e-16 ***
## dist_fire_facilities  1.786e-03  1.665e-04  10.728  < 2e-16 ***
## dist_power_plants    -1.218e-02  2.268e-03  -5.371 7.82e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4204.5  on 4003  degrees of freedom
## Residual deviance: 2583.2  on 3996  degrees of freedom
## AIC: 2599.2
## 
## Number of Fisher Scoring iterations: 6

In this section, to ensure the generalizability of the model, which needs to have a great performance on new data and predict accurately, we analyze the goodness of fits. First, using fit metrics, we can observe the McFadden R-Squared, which suggests that the higher, the better. The following calculation shows that the regression model has the best fit. The results of the model for the training data are included below.

pR2(reg1)

## fitting null model for pseudo-r2

##           llh       llhNull            G2      McFadden          r2ML 
## -1291.5796086 -2102.2721588  1621.3851004     0.3856268     0.3329841 
##          r2CU 
##     0.5122104

3.1 Goodness of Fit

testProbs <- data.frame(Outcome = as.factor(fireTest$burned_num),
                        Probs = predict(reg1, fireTest, type= "response"))

testProbs <- 
  testProbs %>%
  mutate(predOutcome  = as.factor(ifelse(testProbs$Probs > 0.5 , 1, 0)))

3.1.1 Confusion Matrix and Statistics

In this part, I use confusion metrics to better see how the prediction performs. The confusion matrix and its statistics for the regression are shown below. The matrix shows the comparison between observed and predicted burned and unburned area with the fire occurrence is 50% threshold. The accuracy is 88%. Sensitivity, which is the ability to predict true positive outcomes, is 58%. Specificity, which is the ability to predict true negative outcomes, is 96%. This shows that our model can better predict absence of fire and actually no fire occurs. But still, prediction of fire truly occurring is still a moderate prediction in our model.

xtab.reg1 <-caret::confusionMatrix(testProbs$predOutcome, testProbs$Outcome, 
                       positive = "1")

as.matrix(xtab.reg1) %>% kable(caption = "Confusion Matrix") %>% kable_styling("striped", full_width = T, font_size = 14, position = "left")

Confusion Matrix
	0	1
0	1004	122
1	39	169

caret::confusionMatrix(testProbs$predOutcome, testProbs$Outcome, 
                       positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1004  122
##          1   39  169
##                                           
##                Accuracy : 0.8793          
##                  95% CI : (0.8606, 0.8963)
##     No Information Rate : 0.7819          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6056          
##                                           
##  Mcnemar's Test P-Value : 1.03e-10        
##                                           
##             Sensitivity : 0.5808          
##             Specificity : 0.9626          
##          Pos Pred Value : 0.8125          
##          Neg Pred Value : 0.8917          
##              Prevalence : 0.2181          
##          Detection Rate : 0.1267          
##    Detection Prevalence : 0.1559          
##       Balanced Accuracy : 0.7717          
##                                           
##        'Positive' Class : 1               
##

as.matrix(xtab.reg1, what="classes") %>% kable(caption = "Confusion Matrix - Statistics") %>% kable_styling(font_size = 14, full_width = T,
                bootstrap_options = c("striped", "hover"))

Confusion Matrix - Statistics
Sensitivity	0.5807560
Specificity	0.9626079
Pos Pred Value	0.8125000
Neg Pred Value	0.8916519
Precision	0.8125000
Recall	0.5807560
F1	0.6773547
Prevalence	0.2181409
Detection Rate	0.1266867
Detection Prevalence	0.1559220
Balanced Accuracy	0.7716819

3.1.2. Distribution of Predicted Probabilities

Below shows a distribution of outcome of the predicted probabilities in our model. The model can better predict the negatives(no fire occurs) than the positives(fire occurs).

ggplot(testProbs, aes(x = Probs, fill = as.factor(Outcome))) + 
  geom_density() +
  facet_grid(Outcome ~ .) +
  scale_fill_manual(values = palette2) + xlim(0, 1) +
  labs(x = "Fire Occurred or Not", y = "Density of Probabilities",
       title = "Distribution of Predicted Probabilities") +
  plotTheme() + theme(strip.text.x = element_text(size = 18),
        legend.position = "none")

3.1.3. ROC Curve

By plotting the ROC curve, we can see that if it includes a larger area under the curve, it is a better model. As I plotted below, the model includes an area under the curve by 0.8773, which indicates itself a fairly useful fit. Note that the points on the ROC curve means the model will predict true positive while predicting false positive correctly under the threshold.

ggplot(testProbs, aes(d = as.numeric(Outcome), m = Probs)) +
  geom_roc(n.cuts = 50, labels = FALSE, colour = "#FE9900") +
  style_roc(theme = theme_grey) +
  geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'grey') +
  labs(title = "ROC Curve") + plotTheme()

auc(testProbs$Outcome, testProbs$Probs)

## Area under the curve: 0.8773

3.2 Cross Validation

Using cross validation to repeat the process to see the performance of the regression models can ensure that the model generalizes well across contexts. A k-fold cross validation is performed in this section. The model has a ROC of 89% and we can still observe that the model can better predict the true negative.

ctrl <- trainControl(method = "cv", number = 100, classProbs=TRUE, summaryFunction=twoClassSummary)

# str(fishnet2$burned)
# fishnet2$burned <- as.factor(fishnet2$burned) 
# fishnet2$burned <- fct_rev(fishnet2$burned)

cvFit2 <- train(burned ~ .,
                  data=st_drop_geometry(fishnet2) %>% 
                    dplyr::select(-uniqueID, -burned_num, -countFire, -dist_elec_tran), 
                method="glm", family="binomial",
                metric="ROC", trControl = ctrl)

cvFit2

3.2.1 Cross Validation Results

Using cross validation shows that our model perform better when predicting negatives thant predicting positive. It is quite fair results since our use case can also benefit from predincting true negative and shows that there will be no fire for users to pass through the area.

print(cvFit2) %>% kable(caption="Cross-Validation Results") %>% kable_styling(font_size = 14, full_width = T,bootstrap_options = c("striped", "hover"))

## Generalized Linear Model 
## 
## 5338 samples
##   12 predictor
##    2 classes: 'no', 'yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 5285, 5284, 5284, 5284, 5285, 5284, ... 
## Resampling results:
## 
##   ROC        Sens       Spec     
##   0.8900086  0.9588444  0.5990909

Cross-Validation Results
	ROC	Sens	Spec
	0.8900086	0.9588444	0.5990909

The faceted plots below shows cross-validation goodness of fit metrics, which include the mean for area under the curve (“ROC”), sensitivity (“Sens”), and specificity (“Spec”) across all of the 100 folds.(the data in of Sens and Spec are reversed, further fix needed)

We look to the distributions for each of the goodness of fit metrics to evaluate the generalizability of the model. The model has a great generalizability in specificity since the tighter distributions around means indicate great generalizability. The prediction of no fire occurring is well predicted. However, the prediction of fire occurrene has a fair generalizability.

dplyr::select(cvFit2$resample, -Resample) %>%
  gather(metric, value) %>%
  left_join(gather(cvFit2$results[2:4], metric, mean)) %>%
  ggplot(aes(value)) +
    geom_histogram(bins=35, fill = "#f7b61f") +
    facet_wrap(~metric) +
    geom_vline(aes(xintercept = mean), colour = "red", linetype = 3, size = 1.5) +
    scale_x_continuous(limits = c(0, 1)) +
    labs(x="Goodness of Fit", y="Count", title="Cross-Validation Goodness of Fit Metrics",
         subtitle = "Across-fold mean reprented as dotted lines")

3.3 Costs and Benefits

Our model will not have perfect accuracy, and when determining the optimal threshold it’s important to consider the costs and benefits of inaccurately predicting data. Keep in mind that the ultimate use case for this data is to help inform where the fire occurs and pass through the potentially harmful area to avoid fire. Let’s make some approximations and assume the following: * Google maps earn 4.3 billion every year, and there are 143 million users every month. So, we assume that every user contributes $2.32 in revenue from Google maps. * Predictions are being made at the cell level * User will quit using the app when a prediction is wrong

The calculation equations are provided below: * True Positive: 2.32 * Count * True Negative: 2.32 * Count * False Positive: 0 * False Negative: ( -2.32) * Count

 # cost_benefit_table <-
 #    testProbs %>%
 #     mutate(predOutcome = ifelse(Probs > x, 1, 0)) %>% 
 #       count(predOutcome, Outcome) %>%
 #       summarize(True_Negative = sum(n[predOutcome==0 & Outcome==0]),
 #                 True_Positive = sum(n[predOutcome==1 & Outcome==1]),
 #                 False_Negative = sum(n[predOutcome==0 & Outcome==1]),
 #                 False_Positive = sum(n[predOutcome==1 & Outcome==0])) %>%
 #        gather(Variable, Count) %>%
 #        mutate(Revenue =
 #                ifelse(Variable == "True_Negative", Count * 2.32,
 #                ifelse(Variable == "True_Positive",Count * 2.32,
 #                ifelse(Variable == "False_Negative", Count * 0,
 #                ifelse(Variable == "False_Positive", Count * (-2.32), 0))))) %>%
 #     bind_cols(data.frame(Description = c(
 #               "Predicted correctly fire will not occur and it does not happen. Users keep using the app.",
 #               "Predicted correctly fire will occur and it happens. Users keep using the app.",
 #               "Predicted incorrectly that fire will occur but the fire does not occur. Users might not notice and won’t change the user behavior.",
 #               "Predicted incorrectly that fire will not occur but the fire occurs. Users quit using the app.")))
 # 
 # kable(cost_benefit_table,
 #        caption = "Cost/Benefit Table") %>% kable_styling()

3.3.1 Optimal Threshold

It is important to identify the threshold that provides the greatest accuracy to best optimize our model for the use. We add up the counts of each prediction and calculate the percentage to define our threshold. The optimal threshold in this case is 0.41, which has the highest accuracy among all.

iterateThresholds <- function(data, group) {
   group <- enquo(group)
  x = .01
  all_prediction <- data.frame()
  while (x <= 1) {
  
  this_prediction <-
      testProbs %>%
      mutate(predOutcome = ifelse(Probs > x, 1, 0)) %>%
         group_by(!!group) %>%
      dplyr::count(predOutcome, Outcome) %>%
      dplyr::summarize(TN_sum = sum(n[predOutcome==0 & Outcome==0]),
                        TP_sum = sum(n[predOutcome==1 & Outcome==1]),
                        FN_sum = sum(n[predOutcome==0 & Outcome==1]),
                        FP_sum = sum(n[predOutcome==1 & Outcome==0]),
            total=sum(n)) %>%
    mutate(True_Positive = TP_sum / total,
         True_Negative = TN_sum / total,
         False_Negative = FN_sum / total,
         False_Positive = FP_sum / total,
         Accuracy = (TP_sum + TN_sum) / total, Threshold = x)
  
  all_prediction <- rbind(all_prediction, this_prediction)
  x <- x + .01
  }
return(all_prediction)
}

whichThreshold <- iterateThresholds(testProbs)

allThresholds<-kable(whichThreshold) %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)%>%
  scroll_box(width = "100%", height = "500px")

allThresholds

sum_TN	sum_TP	sum_FN	sum_FP	total	True_Positive	True_Negative	False_Negative	False_Positive	Accuracy	Threshold
54	290	1	989	1334	0.2173913	0.0404798	0.0007496	0.7413793	0.2578711	0.01
133	286	5	910	1334	0.2143928	0.0997001	0.0037481	0.6821589	0.3140930	0.02
269	282	9	774	1334	0.2113943	0.2016492	0.0067466	0.5802099	0.4130435	0.03
339	281	10	704	1334	0.2106447	0.2541229	0.0074963	0.5277361	0.4647676	0.04
396	276	15	647	1334	0.2068966	0.2968516	0.0112444	0.4850075	0.5037481	0.05
443	273	18	600	1334	0.2046477	0.3320840	0.0134933	0.4497751	0.5367316	0.06
513	270	21	530	1334	0.2023988	0.3845577	0.0157421	0.3973013	0.5869565	0.07
567	265	26	476	1334	0.1986507	0.4250375	0.0194903	0.3568216	0.6236882	0.08
615	260	31	428	1334	0.1949025	0.4610195	0.0232384	0.3208396	0.6559220	0.09
657	256	35	386	1334	0.1919040	0.4925037	0.0262369	0.2893553	0.6844078	0.10
691	249	42	352	1334	0.1866567	0.5179910	0.0314843	0.2638681	0.7046477	0.11
723	246	45	320	1334	0.1844078	0.5419790	0.0337331	0.2398801	0.7263868	0.12
743	243	48	300	1334	0.1821589	0.5569715	0.0359820	0.2248876	0.7391304	0.13
758	242	49	285	1334	0.1814093	0.5682159	0.0367316	0.2136432	0.7496252	0.14
776	238	53	267	1334	0.1784108	0.5817091	0.0397301	0.2001499	0.7601199	0.15
793	238	53	250	1334	0.1784108	0.5944528	0.0397301	0.1874063	0.7728636	0.16
815	236	55	228	1334	0.1769115	0.6109445	0.0412294	0.1709145	0.7878561	0.17
833	235	56	210	1334	0.1761619	0.6244378	0.0419790	0.1574213	0.8005997	0.18
858	235	56	185	1334	0.1761619	0.6431784	0.0419790	0.1386807	0.8193403	0.19
866	232	59	177	1334	0.1739130	0.6491754	0.0442279	0.1326837	0.8230885	0.20
876	230	61	167	1334	0.1724138	0.6566717	0.0457271	0.1251874	0.8290855	0.21
893	228	63	150	1334	0.1709145	0.6694153	0.0472264	0.1124438	0.8403298	0.22
907	228	63	136	1334	0.1709145	0.6799100	0.0472264	0.1019490	0.8508246	0.23
914	223	68	129	1334	0.1671664	0.6851574	0.0509745	0.0967016	0.8523238	0.24
920	221	70	123	1334	0.1656672	0.6896552	0.0524738	0.0922039	0.8553223	0.25
926	221	70	117	1334	0.1656672	0.6941529	0.0524738	0.0877061	0.8598201	0.26
936	219	72	107	1334	0.1641679	0.7016492	0.0539730	0.0802099	0.8658171	0.27
942	217	74	101	1334	0.1626687	0.7061469	0.0554723	0.0757121	0.8688156	0.28
949	212	79	94	1334	0.1589205	0.7113943	0.0592204	0.0704648	0.8703148	0.29
961	208	83	82	1334	0.1559220	0.7203898	0.0622189	0.0614693	0.8763118	0.30
965	202	89	78	1334	0.1514243	0.7233883	0.0667166	0.0584708	0.8748126	0.31
967	200	91	76	1334	0.1499250	0.7248876	0.0682159	0.0569715	0.8748126	0.32
971	199	92	72	1334	0.1491754	0.7278861	0.0689655	0.0539730	0.8770615	0.33
972	196	95	71	1334	0.1469265	0.7286357	0.0712144	0.0532234	0.8755622	0.34
977	193	98	66	1334	0.1446777	0.7323838	0.0734633	0.0494753	0.8770615	0.35
979	192	99	64	1334	0.1439280	0.7338831	0.0742129	0.0479760	0.8778111	0.36
981	189	102	62	1334	0.1416792	0.7353823	0.0764618	0.0464768	0.8770615	0.37
985	187	104	58	1334	0.1401799	0.7383808	0.0779610	0.0434783	0.8785607	0.38
987	186	105	56	1334	0.1394303	0.7398801	0.0787106	0.0419790	0.8793103	0.39
989	184	107	54	1334	0.1379310	0.7413793	0.0802099	0.0404798	0.8793103	0.40
991	183	108	52	1334	0.1371814	0.7428786	0.0809595	0.0389805	0.8800600	0.41
992	181	110	51	1334	0.1356822	0.7436282	0.0824588	0.0382309	0.8793103	0.42
993	179	112	50	1334	0.1341829	0.7443778	0.0839580	0.0374813	0.8785607	0.43
994	178	113	49	1334	0.1334333	0.7451274	0.0847076	0.0367316	0.8785607	0.44
995	177	114	48	1334	0.1326837	0.7458771	0.0854573	0.0359820	0.8785607	0.45
996	176	115	47	1334	0.1319340	0.7466267	0.0862069	0.0352324	0.8785607	0.46
996	174	117	47	1334	0.1304348	0.7466267	0.0877061	0.0352324	0.8770615	0.47
1000	171	120	43	1334	0.1281859	0.7496252	0.0899550	0.0322339	0.8778111	0.48
1001	169	122	42	1334	0.1266867	0.7503748	0.0914543	0.0314843	0.8770615	0.49
1004	169	122	39	1334	0.1266867	0.7526237	0.0914543	0.0292354	0.8793103	0.50
1004	165	126	39	1334	0.1236882	0.7526237	0.0944528	0.0292354	0.8763118	0.51
1006	164	127	37	1334	0.1229385	0.7541229	0.0952024	0.0277361	0.8770615	0.52
1006	162	129	37	1334	0.1214393	0.7541229	0.0967016	0.0277361	0.8755622	0.53
1006	159	132	37	1334	0.1191904	0.7541229	0.0989505	0.0277361	0.8733133	0.54
1007	157	134	36	1334	0.1176912	0.7548726	0.1004498	0.0269865	0.8725637	0.55
1008	154	137	35	1334	0.1154423	0.7556222	0.1026987	0.0262369	0.8710645	0.56
1008	151	140	35	1334	0.1131934	0.7556222	0.1049475	0.0262369	0.8688156	0.57
1009	149	142	34	1334	0.1116942	0.7563718	0.1064468	0.0254873	0.8680660	0.58
1009	147	144	34	1334	0.1101949	0.7563718	0.1079460	0.0254873	0.8665667	0.59
1009	146	145	34	1334	0.1094453	0.7563718	0.1086957	0.0254873	0.8658171	0.60
1010	143	148	33	1334	0.1071964	0.7571214	0.1109445	0.0247376	0.8643178	0.61
1012	139	152	31	1334	0.1041979	0.7586207	0.1139430	0.0232384	0.8628186	0.62
1013	137	154	30	1334	0.1026987	0.7593703	0.1154423	0.0224888	0.8620690	0.63
1014	135	156	29	1334	0.1011994	0.7601199	0.1169415	0.0217391	0.8613193	0.64
1016	132	159	27	1334	0.0989505	0.7616192	0.1191904	0.0202399	0.8605697	0.65
1016	130	161	27	1334	0.0974513	0.7616192	0.1206897	0.0202399	0.8590705	0.66
1016	129	162	27	1334	0.0967016	0.7616192	0.1214393	0.0202399	0.8583208	0.67
1017	126	165	26	1334	0.0944528	0.7623688	0.1236882	0.0194903	0.8568216	0.68
1017	125	166	26	1334	0.0937031	0.7623688	0.1244378	0.0194903	0.8560720	0.69
1019	121	170	24	1334	0.0907046	0.7638681	0.1274363	0.0179910	0.8545727	0.70
1020	120	171	23	1334	0.0899550	0.7646177	0.1281859	0.0172414	0.8545727	0.71
1021	118	173	22	1334	0.0884558	0.7653673	0.1296852	0.0164918	0.8538231	0.72
1022	114	177	21	1334	0.0854573	0.7661169	0.1326837	0.0157421	0.8515742	0.73
1022	112	179	21	1334	0.0839580	0.7661169	0.1341829	0.0157421	0.8500750	0.74
1023	107	184	20	1334	0.0802099	0.7668666	0.1379310	0.0149925	0.8470765	0.75
1023	104	187	20	1334	0.0779610	0.7668666	0.1401799	0.0149925	0.8448276	0.76
1024	97	194	19	1334	0.0727136	0.7676162	0.1454273	0.0142429	0.8403298	0.77
1025	92	199	18	1334	0.0689655	0.7683658	0.1491754	0.0134933	0.8373313	0.78
1025	90	201	18	1334	0.0674663	0.7683658	0.1506747	0.0134933	0.8358321	0.79
1026	87	204	17	1334	0.0652174	0.7691154	0.1529235	0.0127436	0.8343328	0.80
1026	85	206	17	1334	0.0637181	0.7691154	0.1544228	0.0127436	0.8328336	0.81
1030	82	209	13	1334	0.0614693	0.7721139	0.1566717	0.0097451	0.8335832	0.82
1034	80	211	9	1334	0.0599700	0.7751124	0.1581709	0.0067466	0.8350825	0.83
1034	71	220	9	1334	0.0532234	0.7751124	0.1649175	0.0067466	0.8283358	0.84
1034	70	221	9	1334	0.0524738	0.7751124	0.1656672	0.0067466	0.8275862	0.85
1035	66	225	8	1334	0.0494753	0.7758621	0.1686657	0.0059970	0.8253373	0.86
1036	60	231	7	1334	0.0449775	0.7766117	0.1731634	0.0052474	0.8215892	0.87
1036	57	234	7	1334	0.0427286	0.7766117	0.1754123	0.0052474	0.8193403	0.88
1036	50	241	7	1334	0.0374813	0.7766117	0.1806597	0.0052474	0.8140930	0.89
1036	46	245	7	1334	0.0344828	0.7766117	0.1836582	0.0052474	0.8110945	0.90
1037	39	252	6	1334	0.0292354	0.7773613	0.1889055	0.0044978	0.8065967	0.91
1037	33	258	6	1334	0.0247376	0.7773613	0.1934033	0.0044978	0.8020990	0.92
1037	29	262	6	1334	0.0217391	0.7773613	0.1964018	0.0044978	0.7991004	0.93
1038	28	263	5	1334	0.0209895	0.7781109	0.1971514	0.0037481	0.7991004	0.94
1039	23	268	4	1334	0.0172414	0.7788606	0.2008996	0.0029985	0.7961019	0.95
1041	16	275	2	1334	0.0119940	0.7803598	0.2061469	0.0014993	0.7923538	0.96
1042	14	277	1	1334	0.0104948	0.7811094	0.2076462	0.0007496	0.7916042	0.97
1043	9	282	0	1334	0.0067466	0.7818591	0.2113943	0.0000000	0.7886057	0.98
1043	2	289	0	1334	0.0014993	0.7818591	0.2166417	0.0000000	0.7833583	0.99

whichThreshold <- iterateThresholds(testProbs)

whichThreshold %>%
  ggplot(.,aes(Threshold, Accuracy)) +
  geom_point() +
  scale_colour_manual(values = paletteInferno5) +    
  labs(title = "Confusion Metric Outcome by Threshold",
       y = "Count") +
  guides(colour=guide_legend(title = "Legend")) +
  plotTheme()

4. Results

We join back the predicted probabilities to the fishnet by the unique ID of each cell. We also use weighting to select the grid cells value when overlapped in joining into census tracts.

predict_fishnet <- 
  data.frame(uniqueID = fishnet2$uniqueID,
  Outcome = fishnet2$burned,
  Probs = predict(reg1, fishnet2, type="response"))

predict_fishnet <- 
  predict_fishnet %>%
  mutate(predOutcome  = as.factor(ifelse(predict_fishnet$Probs > 0.41, 1, 0)))


fishnet_final_preds <- cbind(fishnet2,predict_fishnet)

ggplot() +
  geom_sf(data = fishnet_final_preds, aes(fill = Outcome), color = NA) + 
  scale_fill_manual(values = palette2)+
  geom_sf(data=norcal_only, fill= NA, color="#f06813", alpha=0.9) +
     geom_text(data = centroids_counties, aes(X, Y, label = COUNTY_NAME), size = 3, fontface="bold")+
  labs(title = "Prediction by fishnet cells") +
  mapTheme()

census_api_key("16325a5baedf1c986a182c3321f90022be22845a", overwrite = TRUE)

tracts20 <-  
  get_acs(geography = "tract",
          variables = c("B25026_001E"),
          year=2020, state=06, geometry=TRUE) %>% 
  st_transform('ESRI:102242')

fishnet_final_preds_only <- fishnet_final_preds %>% dplyr::select(uniqueID, Probs, geometry)

interpolate_probs <-
  dplyr::select(fishnet_final_preds, Probs) %>%
  st_interpolate_aw(., tracts20, extensive = FALSE) %>% st_sf()


interpolation_tracts <- st_join(interpolate_probs, tracts20, join=st_equals)

fireProb_tracts <- interpolation_tracts %>% dplyr::select(GEOID, geometry, Probs) %>% st_sf()

fireProb_tracts_nogeom <- interpolation_tracts %>% dplyr::select(GEOID, Probs) %>% st_drop_geometry()
final_norcal_only <- 
  st_intersection(st_make_valid(norcal2),
                  fireProb_tracts)
#6. write.csv

ggplot() +
  geom_sf(data = final_norcal_only, aes(fill = Probs), color = NA) +
  labs(title = "Prediction by Census Tracts") +
  mapTheme()

5. Discussion

We are excited to use our model to better predict the wildfire occurrence in northern California. We ran our model and as you can see from the results maps, one by fishnet and one by census tract, we predicted that the fires are more likely to happen in the middle part of the study area. By adding this layer into Google maps, users can avoid the fire occurring areas and find an optimized route to arrive at their destination more effectively and safely. In this model, we only predict the location of wildfires. Since we want to predict the wildfire risk by day, our model will update and run itself everyday with the changing historic or time data. For example, if we want to predict the wildfire risk for Dec. 30th, 2021, the model will use the wildfire history and weather data from Jan. 2nd to 2018 to Dec. 29, 2021. If we want to predict Dec. 30th, 2021, the model will use the wildfire history and weather data from Jan. 3rd, 2018 to Dec. 30th , 2021.

The fire probability generated by our model can better predict no fire occurrence, but it also has a fair generalizability in predicting fire occurrence, both of which can help the users avoid the fire area. To see the accuracy of our model, we used the confusion matrix method. The matrix of 0s and 1s shows the comparison between the observed and predicted instances of fire occurrence at the 41% threshold. Sensitivity, which is the ability to predict true positive outcomes is about 58%, and specificity, which means the ability to predict true negative outcome, is 96%. The result shows the model accomplishes the need.

For our next step, we are going to develop the application with Google and work on the time lag variables. Adding more time features to improve our model to predict more accurate prediction by time. In the long term, we will perform user testing and keep refining our model, and if our model works for the selected counties, we may promote this model to a larger geographic area. Also, we will develop customer services to collect fire data that can help us better predict the fire occurrence and improve our prediction in positive incidents, which may cause more harm when users pass the areas without any notice.