With climate change, wildfire has become more destructive to personal and property safety. There were more than one million fires that resulted in thousands of deaths and injuries in the U.S. And it is only for the year 2021. Among them, California is the most risky state. We are thinking, what if we can predict the wildfire to let people avoid the area in the first place. Well, in this model, we selected Tehama, Glenn, and Mendocino County in California as our testbed. We hope to partner with Google Maps to add a layer of wildfire prediction to the app. Basically, this app has two functions, alert, and route planning. We also hope to add the feature to the current route planning algorithm of google maps. The route planning algorithm will calculate the best route concerning the wildfire risks. For example, when users want to go to Paskenta, but there is a high probability that wildfire will hapMotivation: With climate change, wildfire has become more destructive to personal and property safety. In 2021, there were 1,353,500 fires resulting in 3,800 civilian deaths and 14,700 injuries in the United States. Among them, California is the most risky state. There were 6,994 fires that resulted in the burning of 2,569,459 acres of land, according to The Department of Forestry and Fire Protection.
Given those high numbers of loss of damages, we were thinking what if we can predict the risk level of wildfire to let people avoid the area or move/protect their properties before the fires happen. Therefore,we hope to partner with Google Maps (the map app that has the largest share of users in the map app market) to add a layer of wildfire prediction to the app. Basically, this app has two functions, alert, and route planning. For alert, when people use google maps, we want to inform them about the potential wildfire risk of their destinations by date. For route planning, we hope Google can integrate our wildfire prediction feature to their route planning algorithm.
In this model, we selected Tehama, Glenn, and Mendocino County in California as our testbed, and we gathered the data from the California Department of Forestry and Fire Protection. We used geographic data such as land cover, elevation, the nearest campground, as our location features, and time-related data such as weather and historic fire incidences as our time features. We used these as inputs to train our model and got the results of predicted wildfire location. Then, we used multiple methods to test the accuracy and generalizability of our model, and provided a cost-benefit analysis for Google Maps developers to review.
Ultimately, we would like to make a wildfire prediction layer to Google Maps to deliver wildfire risk notifications to Google Maps users. We hope our model can notify them to avoid potential wildfires at their destinations. In the longer term, we hope to test with more users, refine our model, and promote to a larger geographic area if it works well for the study area in California.
Note: As for now, because we want to predict the wildfire risk by day, our model will update and run itself everyday with the changing historic/time data. For example, if we want to predict the wildfire risk for Dec. 30th, 2021, the model will retrieve the wildfire history and weather data from Jan. 2nd to 2018 to Dec. 29, 2021. If we want to predict Dec. 30th, 2021, the model will use the wildfire history and weather data from Jan. 3rd, 2018 to Dec. 30th , 2021. The inputs of data will change every time in the model depending on the date the user wants to see the predicted result of. When developing the model with Google Maps, we hope to refine our time lag variables in the model to deliver a more time-sensitive and real-time prediction for our users. pen near the National Forest during the selected day. Although the fastest path to Paskenta will pass by the National Forest, the route planning algorithm will avoid this area and plan a new route for users’ safety.
We pull in the geography data of three counties in Northern California and create fishnets in cellsize of 2000.
#all cali boundaries (counties)
counties <- st_read("https://services1.arcgis.com/jUJYIo9tSA7EHvfZ/arcgis/rest/services/California_County_Boundaries/FeatureServer/0/query?outFields=*&where=1%3D1&f=geojson")
#socal counties
norcal <- counties %>%
filter(COUNTY_NAME != "Kern" & COUNTY_NAME != "Riverside" & COUNTY_NAME != "Los Angeles" & COUNTY_NAME != "Imperial" & COUNTY_NAME != "San Bernardino" & COUNTY_NAME != "Ventura" & COUNTY_NAME != "Orange" & COUNTY_NAME != "Santa Barbara" & COUNTY_NAME != "San Luis Obispo" & COUNTY_NAME != "San Diego") %>%
st_transform('ESRI:102242')
norcal2 <- norcal %>%
filter(COUNTY_NAME == "Mendocino" | COUNTY_NAME == "Glenn" | COUNTY_NAME == "Tehama" )%>%
st_transform('ESRI:102242')
# centroids of southern california counties
centroids_counties <- cbind(norcal2, st_coordinates(st_centroid(norcal2)))
# fishnet
fishnet <-
st_make_grid(norcal2,
cellsize = 2000,
square = TRUE) %>%
.[norcal2] %>% # fast way to select intersecting polygons
st_sf() %>%
dplyr::mutate(uniqueID = row_number())
As we can see from the graph, the area in the middle had a lot of fire incidents in the past 3 years.
perimeters <- st_read("https://opendata.arcgis.com/datasets/e3802d2abf8741a187e73a9db49d68fe_2.geojson")
peri1820 <- perimeters %>%
filter(YEAR_ > 2017)%>%
st_transform('ESRI:102242')
norcal_only <-
st_intersection(st_make_valid(norcal2),
peri1820)
ggplot()+
geom_sf(data=norcal2, fill="gray", color="white") +
geom_sf(data=norcal_only, fill="#f06813", color="#f06813", alpha=0.9) +
geom_text(data = centroids_counties, aes(X, Y, label = COUNTY_NAME), size = 3, fontface="bold")+
labs(title = "Fire Perimeters in Mendocino, Glenn, and Tehama, CA",
subtitle="2018-2021",
caption="Data Sources: CAL FIRE; California State Geoportal") +
mapTheme()
Then we divided the whole area by fishnet with a size of 2000 meters. Purple means the lowest occurrence, and yellow means the highest occurrence. We also mapped the burned and unburned land as the dependent variable, and as the map shows, the result corresponds to the fire incidence map.
fire_net <-
dplyr::select(peri1820) %>%
mutate(countFire = 1) %>%
aggregate(., fishnet, sum) %>%
dplyr::mutate(countFire = replace_na(countFire, 0),
uniqueID = 1:n(),
cvID = sample(round(nrow(fishnet) / 24),
size=nrow(fishnet), replace = TRUE))
fire_net <-
fire_net %>%
dplyr::mutate(burned = case_when(countFire == 0 ~ "no",
countFire != 0 ~ "yes"))
ggplot() +
geom_sf(data = fire_net, aes(fill = countFire), color = NA) +
scale_fill_viridis(option="C") +
labs(title = "Count of Fire for the fishnet") +
mapTheme()+ theme(legend.position='right')
ggplot() +
geom_sf(data = fire_net, aes(fill = burned), color = NA) +
labs(title = "Burned v Unburned for the fishnet") +
scale_fill_manual(values = palette2)+
mapTheme()
To find out the reasons for fires, we then tested with different variables.
We explore land cover because we think some kinds of vegetation species may result in higher risk of wildfire. Vegetation is the fuel to wildfires. From the map, we can see that evergreen forests and shrubs made the highest share of land cover among all land cover types, indicating ample fuel for the occurrence of wildfire in this area. We also conducted mode analysis to find which vegetation species made up the highest concentration of vegetation in the area by fishnet.
land_cover <- raster("D:/Teresa/508final/nlcd_2019_land_cover_l48_20210604/nlcd_2019_land_cover_l48_20210604.img")
# #reclass <- c(0,10,0,
# 10,11,11,
# 11,12,12,
# 12,20,0,
# 20,21,21,
# 21,22,22,
# 22,23,23,
# 23,24,24,
# 24,30,0,
# 30,31,31,
# 31,40,0,
# 40,41,41,
# 41,42,42,
# 42,43,43,
# 44,51,0,
# 51,52,52,
# 52,70,0,
# 70,71,71,
# 71,80,0,
# 80,81,81,
# 81,82,82,
# 82,89,0,
# 89,90,90,
# 90,94,0,
# 94,95,95,
# 95,255,0
# )
# class_matrix <- matrix(reclass, ncol = 3, byrow= TRUE)
#land_cover2 <- reclassify(land_cover, class_matrix)
#lc2 <- crop(y = norcal2 %>% st_transform(crs(land_cover)))
fishnet1 <- cbind(fire_net, exact_extract(land_cover, fishnet %>% st_transform(crs(land_cover)), 'majority'))
fishnet1 <- fishnet1 %>%
dplyr::rename(landcover = exact_extract.land_cover..fishnet.....st_transform.crs.land_cover....)
elevation <- raster("D:/Teresa/508final/elevation.tif")
#crop(y = norcal2 %>% st_transform(crs(elevation)))
fishnet2 <- cbind(fishnet1, exact_extract(elevation, fishnet %>% st_transform(crs(elevation)), 'mean'))
fishnet2 <- fishnet2 %>%
dplyr::rename(elevation = exact_extract.elevation..fishnet.....st_transform.crs.elevation....)
fishnet2 <- fishnet2%>%
dplyr::mutate(land_cover = case_when(landcover == 11 ~ "water",
landcover == 12 ~ "perennial snow",
landcover == 21 ~ "developed, open space",
landcover == 22 ~ "developed, low intensity",
landcover == 23 ~ "developed, median intensity",
landcover == 24 ~ "developed, high intensity",
landcover == 31 ~ "barren",
landcover == 41 ~ "deciduous foreest",
landcover == 42 ~ "evergreen forest",
landcover == 43 ~ "mixed forest",
landcover == 52 ~ "shrub",
landcover == 71 ~ "herbaceous",
landcover == 81 ~ "hay",
landcover == 82 ~ "cultivated crops",
landcover == 90 ~ "woody wetlands",
landcover == 95 ~ "emerging herbaceous wetlands")) %>%
dplyr::select(-landcover)
ggplot() +
geom_sf(data = fishnet2, aes(fill = land_cover), color = NA) +
labs(title = "Land cover for the fishnet") +
mapTheme()
We also consider the mean elevation as one of our features because different elevation slopes may cause different exposure levels of wind speed and humility. We gathered the data from the Multi-Resolution Land Characteristics (MRLC) consortium. Different exposure levels of these may cause different levels of wildfire risk. We found that the areas with highest elevation (2000 meters and up) are concentrated in the middle part of the study area, which corresponds to the result of the historic fire incidence map.
ggplot() +
geom_sf(data = fishnet2, aes(fill = elevation), color = NA) +
labs(title = "Mean Elevation for the fishnet") +
mapTheme()
For weather, we gathered the data from ASOS (Automated Surface Observing Systems) weather stations from 2018 to 2021 for Tehama, Glenn, and Mendocino County in California. We found that average temperature is higher and wind speed is faster in the northeastern part of the study area, but the average humidity is higher in Southwestern part. It brings a higher wildfire risk for the northeastern part of the area, because the dryness, faster wind speed, and high temperature are some of the most direct causes of wildfire ignition and spread.
weather <- read.csv("D:/Teresa/508final/weather.csv")
weather_annual <- weather %>% st_as_sf(coords = c("lon", "lat"), crs = 4236, agr="constant") %>% st_transform(st_crs("ESRI:102242"))
library(dplyr)
weather_annual <- weather_annual %>% dplyr::group_by(year, id) %>%
dplyr::summarize(mean_ann_tmpf = mean(mean_tmpf, na.rm = TRUE),
mean_ann_precipitation = mean(mean_precipitation, na.rm = TRUE),
mean_ann_humidity = mean(mean_humidity, na.rm = TRUE),
mean_ann_windspeed = mean(mean_wind_Speed, na.rm = TRUE))
#create new df with only precipitation variable
weather_precip <- weather_annual %>% dplyr::select(geometry, mean_ann_precipitation)
weather_precip.idw <- gstat::idw(mean_ann_precipitation ~ 1, weather_precip, newdata=fishnet2, idp=2.0)
weather_precip.idw <- weather_precip.idw %>% dplyr::rename(avg_precip_idw = var1.pred)
precip.idw <- weather_precip.idw %>% dplyr::select(avg_precip_idw, geometry)
# join to fishnet
fishnet2 <- st_join(fishnet2, precip.idw, join=st_equals)
library(dplyr)
#new df with only wind speed variable
weather_wind <- weather_annual %>% dplyr::select(geometry, mean_ann_windspeed)
weather_wind.idw <- gstat::idw(mean_ann_windspeed ~ 1, weather_wind, newdata=fishnet2, idp=2.0)
weather_wind.idw <- weather_wind.idw %>% dplyr::rename(avg_windspeed_idw = var1.pred)
avg_windspeed.idw <- weather_wind.idw %>% dplyr::select(avg_windspeed_idw, geometry)
# join to fishnet
fishnet2 <- st_join(fishnet2, avg_windspeed.idw, join=st_equals)
weather_temp <- weather_annual %>% dplyr::select(geometry, mean_ann_tmpf)
weather_temp.idw <- gstat::idw(mean_ann_tmpf ~ 1, weather_temp, newdata=fishnet2, idp=2.0)
weather_temp.idw <- weather_temp.idw %>% dplyr::rename(avg_temp_idw = var1.pred)
avg_temp.idw <- weather_temp.idw %>% dplyr::select(avg_temp_idw, geometry)
# join to fishnet
fishnet2 <- st_join(fishnet2, avg_temp.idw, join=st_equals)
library(dplyr)
weather_hum <- weather_annual %>% dplyr::select(geometry, mean_ann_humidity)
weather_hum.idw <- gstat::idw(mean_ann_humidity ~ 1, weather_hum, newdata=fishnet2, idp=2.0)
weather_hum.idw <- weather_hum.idw %>% dplyr::rename(avg_hum_idw = var1.pred)
avg_hum.idw <- weather_hum.idw %>% dplyr::select(avg_hum_idw, geometry)
# join to fishnet
fishnet2 <- st_join(fishnet2, avg_hum.idw, join=st_equals)
precip <- ggplot()+
geom_sf(data=weather_precip.idw, aes(fill=avg_precip_idw)) +
scale_fill_viridis(option="C") +
labs(title="Average Precipitation",
subtitle="2018-2020",
caption="Data source: NOAA",
fill="Inches")+
mapTheme() + theme(legend.position='right')
wind <- ggplot()+
geom_sf(data=weather_wind.idw, aes(fill=avg_windspeed_idw)) +
scale_fill_viridis(option="C") +
labs(title="Average Wind Speed",
subtitle="2018-2020",
caption="Data source: NOAA",
fill="Tenths m/sec")+
mapTheme() + theme(legend.position='right')
temp <- ggplot()+
geom_sf(data=weather_temp.idw, aes(fill=avg_temp_idw)) +
scale_fill_viridis(option="C") +
labs(title="Average Temperature",
subtitle="2018-2020",
caption="Data source: NOAA",
fill="Degrees (F)")+
mapTheme()+ theme(legend.position='right')
humidity <- ggplot()+
geom_sf(data=weather_hum.idw, aes(fill=avg_hum_idw)) +
scale_fill_viridis(option="C") +
labs(title="Average Relative Humidity",
subtitle="2018-2020",
caption="Data source: NOAA",
fill="Percentage")+
mapTheme()+ theme(legend.position='right')
grid.arrange(ncol=2, precip, wind, temp, humidity)
#burned_num
fishnet2 <-
fishnet2 %>%
dplyr::mutate(burned_num = ifelse(countFire == 0, 0, 1))
For Tree mortality, more dead trees mean more fuel for fires to burn. We gathered the data from the California Open Data Portal and the result shows that those dead trees are concentrated in the middle and eastern part of the study area.
tier1hazard_treemortality <- st_read("https://opendata.arcgis.com/datasets/a71a85136b0b414ea734fdfbe3d7674a_0.geojson") %>% st_transform(st_crs(fishnet2))
fishnet2 <- fishnet2 %>% mutate(n_tier1hazard_int = lengths(st_intersects(fishnet2, tier1hazard_treemortality)))
ggplot()+
geom_sf(data=fishnet2, aes(fill=n_tier1hazard_int)) +
scale_fill_viridis(option="T") +
labs(title="Tree Mortality Hazard Classification",
fill="Amount",
subtitle="Per Fishnet Cell",
caption="Data source: California Open Data Portal")+
mapTheme()
We also computed the distance to the nearest campground by using the knn function, because many of the wildfire cases were caused by campfires. Campgrounds are more likely to have campfires. We gathered the data from California Open Data Portal and we found that the campground spread across the study area.
#read in camp grounds
camp_ground <- st_read("D:/Teresa/508final/campground/Campgrounds.shp") %>%
dplyr::select(geometry) %>%
st_transform(st_crs(fishnet2))
# calculate distance to nearest camp ground
fishnet2<-
fishnet2 %>%
mutate(dist_camp_ground=st_nn(st_centroid(fishnet2), camp_ground, 1))
fishnet2$dist_camp_ground <- as.numeric(fishnet2$dist_camp_ground)
ggplot()+
geom_sf(data=fishnet2, aes(fill=dist_camp_ground)) +
scale_fill_viridis(option="B") +
labs(title="Distance to Nearest Camp Grounds",
fill="Distance \n(meters)",
subtitle="Per Fishnet Cell",
caption="Data source: California Open Data Portal")+
mapTheme()
We also explored the nearest electric transmission lines, electrical substation, and nearest power plants. We gathered the data from the California Open Data Portal. The high voltage electricity in those lines, plants, and stations may cause wildfire if it lacks maintenance. There will also be potential flammable liquid leaks and malfunctions at those stations and plants. We should also pay special attention to the high voltage electric transmission lines near the forests. As the maps show, the darker the cells, the closer the distance. And we can see it is concentrated in the Northeastern part of the study area, making it more vulnerable to the risk of wildfire.
#read in electric transmission
electric_transmission <- st_read("D:/Teresa/508final/California_Electric_Transmission_Lines.geojson") %>%
dplyr::select(geometry) %>%
na.omit() %>%
st_transform(st_crs(fishnet2))
# calculate distance to nearest electric transmission
fishnet2<-
fishnet2 %>%
mutate(dist_elec_tran=st_nn(st_centroid(fishnet2), electric_transmission, 1))
fishnet2$dist_elec_tran <- as.numeric(fishnet2$dist_elec_tran)
ggplot()+
geom_sf(data=fishnet2, aes(fill=dist_elec_tran)) +
scale_fill_viridis(option="E") +
labs(title="Distance to Nearest Electirc Transmission",
fill="Distance \n(meters)",
subtitle="Per Fishnet Cell",
caption="Data source: California Open Data Portal")+
mapTheme()
#read in electric substations
electric_substations <- st_read("https://opendata.arcgis.com/datasets/7f37f2535d3144e898a53b9385737ee0_0.geojson") %>%
dplyr::select(Y = Lat, X = Lon) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet2))
# calculate distance to nearest electric substation
fishnet2<-
fishnet2 %>%
mutate(dist_elec_substation=st_nn(st_centroid(fishnet2), electric_substations, 1))
fishnet2$dist_elec_substation <- as.numeric(fishnet2$dist_elec_substation)
ggplot()+
geom_sf(data=fishnet2, aes(fill=dist_elec_substation)) +
scale_fill_viridis(option="E") +
labs(title="Distance to Nearest Electrical Substation",
fill="Distance \n(meters)",
subtitle="Per Fishnet Cell",
caption="Data source: California Open Data Portal")+
mapTheme()
fire_facilities <- st_read("https://gis.data.cnra.ca.gov/datasets/CALFIRE-Forestry::cal-fire-facilities-for-wildland-fire-protection.geojson?outSR=%7B%22latestWkid%22%3A3857%2C%22wkid%22%3A102100%7D") %>%
dplyr::select(Y = LAT, X = LON) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_zm()%>%
st_transform(st_crs(fishnet2))
# calculate distance to nearest facility
fishnet2<-
fishnet2 %>%
mutate(dist_fire_facilities=st_nn(st_centroid(fishnet2), fire_facilities, 1))
fishnet2$dist_fire_facilities <- as.numeric(fishnet2$dist_fire_facilities)
power_plants <- st_read("https://services3.arcgis.com/bWPjFyq029ChCGur/arcgis/rest/services/Power_Plant/FeatureServer/0/query?outFields=*&where=1%3D1&f=geojson") %>%
dplyr::select(geometry) %>%
na.omit() %>%
st_transform(st_crs(fishnet2))
# calculate distance to nearest power plants
fishnet2<-
fishnet2 %>%
mutate(dist_power_plants=st_nn(st_centroid(fishnet2), power_plants, 1))
fishnet2$dist_power_plants <- as.numeric(fishnet2$dist_power_plants)
ggplot()+
geom_sf(data=fishnet2, aes(fill=dist_power_plants)) +
scale_fill_viridis(option="E") +
labs(title="Distance to Nearest Power Plants",
fill="Distance \n(meters)",
subtitle="Per Fishnet Cell",
caption="Data source: California Open Data Portal")+
mapTheme()
We also calculate the distance from each cell’s centroid to the nearest fire facility. The data is from the California Open Data Portal. We can see that there is a disproportionate distribution of fire facilities in that the northern part of the study area is closer to the contracted fire facilities than the southern part of the study area. It does not correlate much with the historic fire incidence map.
ggplot()+
geom_sf(data=fishnet2, aes(fill=dist_fire_facilities)) +
scale_fill_viridis(option="E") +
labs(title="Distance to Nearest Fire Facilities",
fill="Distance \n(meters)",
subtitle="Per Fishnet Cell",
caption="Data source: California Open Data Portal")+
mapTheme()
#write.csv(fishnet2, "D:/Teresa/508final/fishnet2.csv")
In this section, we use a logistic regression model which can predict a binary outcome, along with the variables and the new features. It is a well-suited model for our case of predicting the probability that a fire will occur. First, the data is divided into 75% of training datasets and 25% of testing datasets, and the model runs on the training set by using createDataPartition. Then, the regression model used all the variables including the engineered features that help improve the model. As a result, the model has the lowest AIC, which shows a better fit.
set.seed(3456)
trainIndex <- createDataPartition(fishnet2$burned,
p = .75,
list = FALSE,
times = 1)
fireTrain <- fishnet2[ trainIndex,]
fireTest <- fishnet2[-trainIndex,]
reg1<- glm(burned_num ~ .,
data=st_drop_geometry(fireTrain) %>% dplyr::select(-uniqueID, -burned, -countFire, -cvID, -land_cover, -dist_elec_tran), family="binomial")
summary(reg1)
##
## Call:
## glm(formula = burned_num ~ ., family = "binomial", data = st_drop_geometry(fireTrain) %>%
## dplyr::select(-uniqueID, -burned, -countFire, -cvID, -land_cover,
## -dist_elec_tran))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.1772 -0.5114 -0.3179 -0.1380 2.9709
##
## Coefficients: (3 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.457e+01 5.045e+00 4.871 1.11e-06 ***
## elevation 3.635e-03 1.359e-04 26.759 < 2e-16 ***
## avg_precip_idw -6.639e+02 1.204e+02 -5.515 3.49e-08 ***
## avg_windspeed_idw NA NA NA NA
## avg_temp_idw NA NA NA NA
## avg_hum_idw NA NA NA NA
## n_tier1hazard_int -9.250e-02 2.006e-02 -4.611 4.01e-06 ***
## dist_camp_ground 1.620e-03 5.473e-04 2.960 0.00307 **
## dist_elec_substation -8.112e-04 5.557e-05 -14.599 < 2e-16 ***
## dist_fire_facilities 1.786e-03 1.665e-04 10.728 < 2e-16 ***
## dist_power_plants -1.218e-02 2.268e-03 -5.371 7.82e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4204.5 on 4003 degrees of freedom
## Residual deviance: 2583.2 on 3996 degrees of freedom
## AIC: 2599.2
##
## Number of Fisher Scoring iterations: 6
In this section, to ensure the generalizability of the model, which needs to have a great performance on new data and predict accurately, we analyze the goodness of fits. First, using fit metrics, we can observe the McFadden R-Squared, which suggests that the higher, the better. The following calculation shows that the regression model has the best fit. The results of the model for the training data are included below.
pR2(reg1)
## fitting null model for pseudo-r2
## llh llhNull G2 McFadden r2ML
## -1291.5796086 -2102.2721588 1621.3851004 0.3856268 0.3329841
## r2CU
## 0.5122104
testProbs <- data.frame(Outcome = as.factor(fireTest$burned_num),
Probs = predict(reg1, fireTest, type= "response"))
testProbs <-
testProbs %>%
mutate(predOutcome = as.factor(ifelse(testProbs$Probs > 0.5 , 1, 0)))
In this part, I use confusion metrics to better see how the prediction performs. The confusion matrix and its statistics for the regression are shown below. The matrix shows the comparison between observed and predicted burned and unburned area with the fire occurrence is 50% threshold. The accuracy is 88%. Sensitivity, which is the ability to predict true positive outcomes, is 58%. Specificity, which is the ability to predict true negative outcomes, is 96%. This shows that our model can better predict absence of fire and actually no fire occurs. But still, prediction of fire truly occurring is still a moderate prediction in our model.
xtab.reg1 <-caret::confusionMatrix(testProbs$predOutcome, testProbs$Outcome,
positive = "1")
as.matrix(xtab.reg1) %>% kable(caption = "Confusion Matrix") %>% kable_styling("striped", full_width = T, font_size = 14, position = "left")
0 | 1 | |
---|---|---|
0 | 1004 | 122 |
1 | 39 | 169 |
caret::confusionMatrix(testProbs$predOutcome, testProbs$Outcome,
positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1004 122
## 1 39 169
##
## Accuracy : 0.8793
## 95% CI : (0.8606, 0.8963)
## No Information Rate : 0.7819
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6056
##
## Mcnemar's Test P-Value : 1.03e-10
##
## Sensitivity : 0.5808
## Specificity : 0.9626
## Pos Pred Value : 0.8125
## Neg Pred Value : 0.8917
## Prevalence : 0.2181
## Detection Rate : 0.1267
## Detection Prevalence : 0.1559
## Balanced Accuracy : 0.7717
##
## 'Positive' Class : 1
##
as.matrix(xtab.reg1, what="classes") %>% kable(caption = "Confusion Matrix - Statistics") %>% kable_styling(font_size = 14, full_width = T,
bootstrap_options = c("striped", "hover"))
Sensitivity | 0.5807560 |
Specificity | 0.9626079 |
Pos Pred Value | 0.8125000 |
Neg Pred Value | 0.8916519 |
Precision | 0.8125000 |
Recall | 0.5807560 |
F1 | 0.6773547 |
Prevalence | 0.2181409 |
Detection Rate | 0.1266867 |
Detection Prevalence | 0.1559220 |
Balanced Accuracy | 0.7716819 |
Below shows a distribution of outcome of the predicted probabilities in our model. The model can better predict the negatives(no fire occurs) than the positives(fire occurs).
ggplot(testProbs, aes(x = Probs, fill = as.factor(Outcome))) +
geom_density() +
facet_grid(Outcome ~ .) +
scale_fill_manual(values = palette2) + xlim(0, 1) +
labs(x = "Fire Occurred or Not", y = "Density of Probabilities",
title = "Distribution of Predicted Probabilities") +
plotTheme() + theme(strip.text.x = element_text(size = 18),
legend.position = "none")
By plotting the ROC curve, we can see that if it includes a larger area under the curve, it is a better model. As I plotted below, the model includes an area under the curve by 0.8773, which indicates itself a fairly useful fit. Note that the points on the ROC curve means the model will predict true positive while predicting false positive correctly under the threshold.
ggplot(testProbs, aes(d = as.numeric(Outcome), m = Probs)) +
geom_roc(n.cuts = 50, labels = FALSE, colour = "#FE9900") +
style_roc(theme = theme_grey) +
geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'grey') +
labs(title = "ROC Curve") + plotTheme()
auc(testProbs$Outcome, testProbs$Probs)
## Area under the curve: 0.8773
Using cross validation to repeat the process to see the performance of the regression models can ensure that the model generalizes well across contexts. A k-fold cross validation is performed in this section. The model has a ROC of 89% and we can still observe that the model can better predict the true negative.
ctrl <- trainControl(method = "cv", number = 100, classProbs=TRUE, summaryFunction=twoClassSummary)
# str(fishnet2$burned)
# fishnet2$burned <- as.factor(fishnet2$burned)
# fishnet2$burned <- fct_rev(fishnet2$burned)
cvFit2 <- train(burned ~ .,
data=st_drop_geometry(fishnet2) %>%
dplyr::select(-uniqueID, -burned_num, -countFire, -dist_elec_tran),
method="glm", family="binomial",
metric="ROC", trControl = ctrl)
cvFit2
Using cross validation shows that our model perform better when predicting negatives thant predicting positive. It is quite fair results since our use case can also benefit from predincting true negative and shows that there will be no fire for users to pass through the area.
print(cvFit2) %>% kable(caption="Cross-Validation Results") %>% kable_styling(font_size = 14, full_width = T,bootstrap_options = c("striped", "hover"))
## Generalized Linear Model
##
## 5338 samples
## 12 predictor
## 2 classes: 'no', 'yes'
##
## No pre-processing
## Resampling: Cross-Validated (100 fold)
## Summary of sample sizes: 5285, 5284, 5284, 5284, 5285, 5284, ...
## Resampling results:
##
## ROC Sens Spec
## 0.8900086 0.9588444 0.5990909
ROC | Sens | Spec | |
---|---|---|---|
0.8900086 | 0.9588444 | 0.5990909 |
The faceted plots below shows cross-validation goodness of fit metrics, which include the mean for area under the curve (“ROC”), sensitivity (“Sens”), and specificity (“Spec”) across all of the 100 folds.(the data in of Sens and Spec are reversed, further fix needed)
We look to the distributions for each of the goodness of fit metrics to evaluate the generalizability of the model. The model has a great generalizability in specificity since the tighter distributions around means indicate great generalizability. The prediction of no fire occurring is well predicted. However, the prediction of fire occurrene has a fair generalizability.
dplyr::select(cvFit2$resample, -Resample) %>%
gather(metric, value) %>%
left_join(gather(cvFit2$results[2:4], metric, mean)) %>%
ggplot(aes(value)) +
geom_histogram(bins=35, fill = "#f7b61f") +
facet_wrap(~metric) +
geom_vline(aes(xintercept = mean), colour = "red", linetype = 3, size = 1.5) +
scale_x_continuous(limits = c(0, 1)) +
labs(x="Goodness of Fit", y="Count", title="Cross-Validation Goodness of Fit Metrics",
subtitle = "Across-fold mean reprented as dotted lines")
Our model will not have perfect accuracy, and when determining the optimal threshold it’s important to consider the costs and benefits of inaccurately predicting data. Keep in mind that the ultimate use case for this data is to help inform where the fire occurs and pass through the potentially harmful area to avoid fire. Let’s make some approximations and assume the following: * Google maps earn 4.3 billion every year, and there are 143 million users every month. So, we assume that every user contributes $2.32 in revenue from Google maps. * Predictions are being made at the cell level * User will quit using the app when a prediction is wrong
The calculation equations are provided below: * True Positive: 2.32 * Count * True Negative: 2.32 * Count * False Positive: 0 * False Negative: ( -2.32) * Count
# cost_benefit_table <-
# testProbs %>%
# mutate(predOutcome = ifelse(Probs > x, 1, 0)) %>%
# count(predOutcome, Outcome) %>%
# summarize(True_Negative = sum(n[predOutcome==0 & Outcome==0]),
# True_Positive = sum(n[predOutcome==1 & Outcome==1]),
# False_Negative = sum(n[predOutcome==0 & Outcome==1]),
# False_Positive = sum(n[predOutcome==1 & Outcome==0])) %>%
# gather(Variable, Count) %>%
# mutate(Revenue =
# ifelse(Variable == "True_Negative", Count * 2.32,
# ifelse(Variable == "True_Positive",Count * 2.32,
# ifelse(Variable == "False_Negative", Count * 0,
# ifelse(Variable == "False_Positive", Count * (-2.32), 0))))) %>%
# bind_cols(data.frame(Description = c(
# "Predicted correctly fire will not occur and it does not happen. Users keep using the app.",
# "Predicted correctly fire will occur and it happens. Users keep using the app.",
# "Predicted incorrectly that fire will occur but the fire does not occur. Users might not notice and won’t change the user behavior.",
# "Predicted incorrectly that fire will not occur but the fire occurs. Users quit using the app.")))
#
# kable(cost_benefit_table,
# caption = "Cost/Benefit Table") %>% kable_styling()
It is important to identify the threshold that provides the greatest accuracy to best optimize our model for the use. We add up the counts of each prediction and calculate the percentage to define our threshold. The optimal threshold in this case is 0.41, which has the highest accuracy among all.
iterateThresholds <- function(data, group) {
group <- enquo(group)
x = .01
all_prediction <- data.frame()
while (x <= 1) {
this_prediction <-
testProbs %>%
mutate(predOutcome = ifelse(Probs > x, 1, 0)) %>%
group_by(!!group) %>%
dplyr::count(predOutcome, Outcome) %>%
dplyr::summarize(TN_sum = sum(n[predOutcome==0 & Outcome==0]),
TP_sum = sum(n[predOutcome==1 & Outcome==1]),
FN_sum = sum(n[predOutcome==0 & Outcome==1]),
FP_sum = sum(n[predOutcome==1 & Outcome==0]),
total=sum(n)) %>%
mutate(True_Positive = TP_sum / total,
True_Negative = TN_sum / total,
False_Negative = FN_sum / total,
False_Positive = FP_sum / total,
Accuracy = (TP_sum + TN_sum) / total, Threshold = x)
all_prediction <- rbind(all_prediction, this_prediction)
x <- x + .01
}
return(all_prediction)
}
whichThreshold <- iterateThresholds(testProbs)
allThresholds<-kable(whichThreshold) %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)%>%
scroll_box(width = "100%", height = "500px")
allThresholds
sum_TN | sum_TP | sum_FN | sum_FP | total | True_Positive | True_Negative | False_Negative | False_Positive | Accuracy | Threshold |
---|---|---|---|---|---|---|---|---|---|---|
54 | 290 | 1 | 989 | 1334 | 0.2173913 | 0.0404798 | 0.0007496 | 0.7413793 | 0.2578711 | 0.01 |
133 | 286 | 5 | 910 | 1334 | 0.2143928 | 0.0997001 | 0.0037481 | 0.6821589 | 0.3140930 | 0.02 |
269 | 282 | 9 | 774 | 1334 | 0.2113943 | 0.2016492 | 0.0067466 | 0.5802099 | 0.4130435 | 0.03 |
339 | 281 | 10 | 704 | 1334 | 0.2106447 | 0.2541229 | 0.0074963 | 0.5277361 | 0.4647676 | 0.04 |
396 | 276 | 15 | 647 | 1334 | 0.2068966 | 0.2968516 | 0.0112444 | 0.4850075 | 0.5037481 | 0.05 |
443 | 273 | 18 | 600 | 1334 | 0.2046477 | 0.3320840 | 0.0134933 | 0.4497751 | 0.5367316 | 0.06 |
513 | 270 | 21 | 530 | 1334 | 0.2023988 | 0.3845577 | 0.0157421 | 0.3973013 | 0.5869565 | 0.07 |
567 | 265 | 26 | 476 | 1334 | 0.1986507 | 0.4250375 | 0.0194903 | 0.3568216 | 0.6236882 | 0.08 |
615 | 260 | 31 | 428 | 1334 | 0.1949025 | 0.4610195 | 0.0232384 | 0.3208396 | 0.6559220 | 0.09 |
657 | 256 | 35 | 386 | 1334 | 0.1919040 | 0.4925037 | 0.0262369 | 0.2893553 | 0.6844078 | 0.10 |
691 | 249 | 42 | 352 | 1334 | 0.1866567 | 0.5179910 | 0.0314843 | 0.2638681 | 0.7046477 | 0.11 |
723 | 246 | 45 | 320 | 1334 | 0.1844078 | 0.5419790 | 0.0337331 | 0.2398801 | 0.7263868 | 0.12 |
743 | 243 | 48 | 300 | 1334 | 0.1821589 | 0.5569715 | 0.0359820 | 0.2248876 | 0.7391304 | 0.13 |
758 | 242 | 49 | 285 | 1334 | 0.1814093 | 0.5682159 | 0.0367316 | 0.2136432 | 0.7496252 | 0.14 |
776 | 238 | 53 | 267 | 1334 | 0.1784108 | 0.5817091 | 0.0397301 | 0.2001499 | 0.7601199 | 0.15 |
793 | 238 | 53 | 250 | 1334 | 0.1784108 | 0.5944528 | 0.0397301 | 0.1874063 | 0.7728636 | 0.16 |
815 | 236 | 55 | 228 | 1334 | 0.1769115 | 0.6109445 | 0.0412294 | 0.1709145 | 0.7878561 | 0.17 |
833 | 235 | 56 | 210 | 1334 | 0.1761619 | 0.6244378 | 0.0419790 | 0.1574213 | 0.8005997 | 0.18 |
858 | 235 | 56 | 185 | 1334 | 0.1761619 | 0.6431784 | 0.0419790 | 0.1386807 | 0.8193403 | 0.19 |
866 | 232 | 59 | 177 | 1334 | 0.1739130 | 0.6491754 | 0.0442279 | 0.1326837 | 0.8230885 | 0.20 |
876 | 230 | 61 | 167 | 1334 | 0.1724138 | 0.6566717 | 0.0457271 | 0.1251874 | 0.8290855 | 0.21 |
893 | 228 | 63 | 150 | 1334 | 0.1709145 | 0.6694153 | 0.0472264 | 0.1124438 | 0.8403298 | 0.22 |
907 | 228 | 63 | 136 | 1334 | 0.1709145 | 0.6799100 | 0.0472264 | 0.1019490 | 0.8508246 | 0.23 |
914 | 223 | 68 | 129 | 1334 | 0.1671664 | 0.6851574 | 0.0509745 | 0.0967016 | 0.8523238 | 0.24 |
920 | 221 | 70 | 123 | 1334 | 0.1656672 | 0.6896552 | 0.0524738 | 0.0922039 | 0.8553223 | 0.25 |
926 | 221 | 70 | 117 | 1334 | 0.1656672 | 0.6941529 | 0.0524738 | 0.0877061 | 0.8598201 | 0.26 |
936 | 219 | 72 | 107 | 1334 | 0.1641679 | 0.7016492 | 0.0539730 | 0.0802099 | 0.8658171 | 0.27 |
942 | 217 | 74 | 101 | 1334 | 0.1626687 | 0.7061469 | 0.0554723 | 0.0757121 | 0.8688156 | 0.28 |
949 | 212 | 79 | 94 | 1334 | 0.1589205 | 0.7113943 | 0.0592204 | 0.0704648 | 0.8703148 | 0.29 |
961 | 208 | 83 | 82 | 1334 | 0.1559220 | 0.7203898 | 0.0622189 | 0.0614693 | 0.8763118 | 0.30 |
965 | 202 | 89 | 78 | 1334 | 0.1514243 | 0.7233883 | 0.0667166 | 0.0584708 | 0.8748126 | 0.31 |
967 | 200 | 91 | 76 | 1334 | 0.1499250 | 0.7248876 | 0.0682159 | 0.0569715 | 0.8748126 | 0.32 |
971 | 199 | 92 | 72 | 1334 | 0.1491754 | 0.7278861 | 0.0689655 | 0.0539730 | 0.8770615 | 0.33 |
972 | 196 | 95 | 71 | 1334 | 0.1469265 | 0.7286357 | 0.0712144 | 0.0532234 | 0.8755622 | 0.34 |
977 | 193 | 98 | 66 | 1334 | 0.1446777 | 0.7323838 | 0.0734633 | 0.0494753 | 0.8770615 | 0.35 |
979 | 192 | 99 | 64 | 1334 | 0.1439280 | 0.7338831 | 0.0742129 | 0.0479760 | 0.8778111 | 0.36 |
981 | 189 | 102 | 62 | 1334 | 0.1416792 | 0.7353823 | 0.0764618 | 0.0464768 | 0.8770615 | 0.37 |
985 | 187 | 104 | 58 | 1334 | 0.1401799 | 0.7383808 | 0.0779610 | 0.0434783 | 0.8785607 | 0.38 |
987 | 186 | 105 | 56 | 1334 | 0.1394303 | 0.7398801 | 0.0787106 | 0.0419790 | 0.8793103 | 0.39 |
989 | 184 | 107 | 54 | 1334 | 0.1379310 | 0.7413793 | 0.0802099 | 0.0404798 | 0.8793103 | 0.40 |
991 | 183 | 108 | 52 | 1334 | 0.1371814 | 0.7428786 | 0.0809595 | 0.0389805 | 0.8800600 | 0.41 |
992 | 181 | 110 | 51 | 1334 | 0.1356822 | 0.7436282 | 0.0824588 | 0.0382309 | 0.8793103 | 0.42 |
993 | 179 | 112 | 50 | 1334 | 0.1341829 | 0.7443778 | 0.0839580 | 0.0374813 | 0.8785607 | 0.43 |
994 | 178 | 113 | 49 | 1334 | 0.1334333 | 0.7451274 | 0.0847076 | 0.0367316 | 0.8785607 | 0.44 |
995 | 177 | 114 | 48 | 1334 | 0.1326837 | 0.7458771 | 0.0854573 | 0.0359820 | 0.8785607 | 0.45 |
996 | 176 | 115 | 47 | 1334 | 0.1319340 | 0.7466267 | 0.0862069 | 0.0352324 | 0.8785607 | 0.46 |
996 | 174 | 117 | 47 | 1334 | 0.1304348 | 0.7466267 | 0.0877061 | 0.0352324 | 0.8770615 | 0.47 |
1000 | 171 | 120 | 43 | 1334 | 0.1281859 | 0.7496252 | 0.0899550 | 0.0322339 | 0.8778111 | 0.48 |
1001 | 169 | 122 | 42 | 1334 | 0.1266867 | 0.7503748 | 0.0914543 | 0.0314843 | 0.8770615 | 0.49 |
1004 | 169 | 122 | 39 | 1334 | 0.1266867 | 0.7526237 | 0.0914543 | 0.0292354 | 0.8793103 | 0.50 |
1004 | 165 | 126 | 39 | 1334 | 0.1236882 | 0.7526237 | 0.0944528 | 0.0292354 | 0.8763118 | 0.51 |
1006 | 164 | 127 | 37 | 1334 | 0.1229385 | 0.7541229 | 0.0952024 | 0.0277361 | 0.8770615 | 0.52 |
1006 | 162 | 129 | 37 | 1334 | 0.1214393 | 0.7541229 | 0.0967016 | 0.0277361 | 0.8755622 | 0.53 |
1006 | 159 | 132 | 37 | 1334 | 0.1191904 | 0.7541229 | 0.0989505 | 0.0277361 | 0.8733133 | 0.54 |
1007 | 157 | 134 | 36 | 1334 | 0.1176912 | 0.7548726 | 0.1004498 | 0.0269865 | 0.8725637 | 0.55 |
1008 | 154 | 137 | 35 | 1334 | 0.1154423 | 0.7556222 | 0.1026987 | 0.0262369 | 0.8710645 | 0.56 |
1008 | 151 | 140 | 35 | 1334 | 0.1131934 | 0.7556222 | 0.1049475 | 0.0262369 | 0.8688156 | 0.57 |
1009 | 149 | 142 | 34 | 1334 | 0.1116942 | 0.7563718 | 0.1064468 | 0.0254873 | 0.8680660 | 0.58 |
1009 | 147 | 144 | 34 | 1334 | 0.1101949 | 0.7563718 | 0.1079460 | 0.0254873 | 0.8665667 | 0.59 |
1009 | 146 | 145 | 34 | 1334 | 0.1094453 | 0.7563718 | 0.1086957 | 0.0254873 | 0.8658171 | 0.60 |
1010 | 143 | 148 | 33 | 1334 | 0.1071964 | 0.7571214 | 0.1109445 | 0.0247376 | 0.8643178 | 0.61 |
1012 | 139 | 152 | 31 | 1334 | 0.1041979 | 0.7586207 | 0.1139430 | 0.0232384 | 0.8628186 | 0.62 |
1013 | 137 | 154 | 30 | 1334 | 0.1026987 | 0.7593703 | 0.1154423 | 0.0224888 | 0.8620690 | 0.63 |
1014 | 135 | 156 | 29 | 1334 | 0.1011994 | 0.7601199 | 0.1169415 | 0.0217391 | 0.8613193 | 0.64 |
1016 | 132 | 159 | 27 | 1334 | 0.0989505 | 0.7616192 | 0.1191904 | 0.0202399 | 0.8605697 | 0.65 |
1016 | 130 | 161 | 27 | 1334 | 0.0974513 | 0.7616192 | 0.1206897 | 0.0202399 | 0.8590705 | 0.66 |
1016 | 129 | 162 | 27 | 1334 | 0.0967016 | 0.7616192 | 0.1214393 | 0.0202399 | 0.8583208 | 0.67 |
1017 | 126 | 165 | 26 | 1334 | 0.0944528 | 0.7623688 | 0.1236882 | 0.0194903 | 0.8568216 | 0.68 |
1017 | 125 | 166 | 26 | 1334 | 0.0937031 | 0.7623688 | 0.1244378 | 0.0194903 | 0.8560720 | 0.69 |
1019 | 121 | 170 | 24 | 1334 | 0.0907046 | 0.7638681 | 0.1274363 | 0.0179910 | 0.8545727 | 0.70 |
1020 | 120 | 171 | 23 | 1334 | 0.0899550 | 0.7646177 | 0.1281859 | 0.0172414 | 0.8545727 | 0.71 |
1021 | 118 | 173 | 22 | 1334 | 0.0884558 | 0.7653673 | 0.1296852 | 0.0164918 | 0.8538231 | 0.72 |
1022 | 114 | 177 | 21 | 1334 | 0.0854573 | 0.7661169 | 0.1326837 | 0.0157421 | 0.8515742 | 0.73 |
1022 | 112 | 179 | 21 | 1334 | 0.0839580 | 0.7661169 | 0.1341829 | 0.0157421 | 0.8500750 | 0.74 |
1023 | 107 | 184 | 20 | 1334 | 0.0802099 | 0.7668666 | 0.1379310 | 0.0149925 | 0.8470765 | 0.75 |
1023 | 104 | 187 | 20 | 1334 | 0.0779610 | 0.7668666 | 0.1401799 | 0.0149925 | 0.8448276 | 0.76 |
1024 | 97 | 194 | 19 | 1334 | 0.0727136 | 0.7676162 | 0.1454273 | 0.0142429 | 0.8403298 | 0.77 |
1025 | 92 | 199 | 18 | 1334 | 0.0689655 | 0.7683658 | 0.1491754 | 0.0134933 | 0.8373313 | 0.78 |
1025 | 90 | 201 | 18 | 1334 | 0.0674663 | 0.7683658 | 0.1506747 | 0.0134933 | 0.8358321 | 0.79 |
1026 | 87 | 204 | 17 | 1334 | 0.0652174 | 0.7691154 | 0.1529235 | 0.0127436 | 0.8343328 | 0.80 |
1026 | 85 | 206 | 17 | 1334 | 0.0637181 | 0.7691154 | 0.1544228 | 0.0127436 | 0.8328336 | 0.81 |
1030 | 82 | 209 | 13 | 1334 | 0.0614693 | 0.7721139 | 0.1566717 | 0.0097451 | 0.8335832 | 0.82 |
1034 | 80 | 211 | 9 | 1334 | 0.0599700 | 0.7751124 | 0.1581709 | 0.0067466 | 0.8350825 | 0.83 |
1034 | 71 | 220 | 9 | 1334 | 0.0532234 | 0.7751124 | 0.1649175 | 0.0067466 | 0.8283358 | 0.84 |
1034 | 70 | 221 | 9 | 1334 | 0.0524738 | 0.7751124 | 0.1656672 | 0.0067466 | 0.8275862 | 0.85 |
1035 | 66 | 225 | 8 | 1334 | 0.0494753 | 0.7758621 | 0.1686657 | 0.0059970 | 0.8253373 | 0.86 |
1036 | 60 | 231 | 7 | 1334 | 0.0449775 | 0.7766117 | 0.1731634 | 0.0052474 | 0.8215892 | 0.87 |
1036 | 57 | 234 | 7 | 1334 | 0.0427286 | 0.7766117 | 0.1754123 | 0.0052474 | 0.8193403 | 0.88 |
1036 | 50 | 241 | 7 | 1334 | 0.0374813 | 0.7766117 | 0.1806597 | 0.0052474 | 0.8140930 | 0.89 |
1036 | 46 | 245 | 7 | 1334 | 0.0344828 | 0.7766117 | 0.1836582 | 0.0052474 | 0.8110945 | 0.90 |
1037 | 39 | 252 | 6 | 1334 | 0.0292354 | 0.7773613 | 0.1889055 | 0.0044978 | 0.8065967 | 0.91 |
1037 | 33 | 258 | 6 | 1334 | 0.0247376 | 0.7773613 | 0.1934033 | 0.0044978 | 0.8020990 | 0.92 |
1037 | 29 | 262 | 6 | 1334 | 0.0217391 | 0.7773613 | 0.1964018 | 0.0044978 | 0.7991004 | 0.93 |
1038 | 28 | 263 | 5 | 1334 | 0.0209895 | 0.7781109 | 0.1971514 | 0.0037481 | 0.7991004 | 0.94 |
1039 | 23 | 268 | 4 | 1334 | 0.0172414 | 0.7788606 | 0.2008996 | 0.0029985 | 0.7961019 | 0.95 |
1041 | 16 | 275 | 2 | 1334 | 0.0119940 | 0.7803598 | 0.2061469 | 0.0014993 | 0.7923538 | 0.96 |
1042 | 14 | 277 | 1 | 1334 | 0.0104948 | 0.7811094 | 0.2076462 | 0.0007496 | 0.7916042 | 0.97 |
1043 | 9 | 282 | 0 | 1334 | 0.0067466 | 0.7818591 | 0.2113943 | 0.0000000 | 0.7886057 | 0.98 |
1043 | 2 | 289 | 0 | 1334 | 0.0014993 | 0.7818591 | 0.2166417 | 0.0000000 | 0.7833583 | 0.99 |
whichThreshold <- iterateThresholds(testProbs)
whichThreshold %>%
ggplot(.,aes(Threshold, Accuracy)) +
geom_point() +
scale_colour_manual(values = paletteInferno5) +
labs(title = "Confusion Metric Outcome by Threshold",
y = "Count") +
guides(colour=guide_legend(title = "Legend")) +
plotTheme()
We join back the predicted probabilities to the fishnet by the unique ID of each cell. We also use weighting to select the grid cells value when overlapped in joining into census tracts.
predict_fishnet <-
data.frame(uniqueID = fishnet2$uniqueID,
Outcome = fishnet2$burned,
Probs = predict(reg1, fishnet2, type="response"))
predict_fishnet <-
predict_fishnet %>%
mutate(predOutcome = as.factor(ifelse(predict_fishnet$Probs > 0.41, 1, 0)))
fishnet_final_preds <- cbind(fishnet2,predict_fishnet)
ggplot() +
geom_sf(data = fishnet_final_preds, aes(fill = Outcome), color = NA) +
scale_fill_manual(values = palette2)+
geom_sf(data=norcal_only, fill= NA, color="#f06813", alpha=0.9) +
geom_text(data = centroids_counties, aes(X, Y, label = COUNTY_NAME), size = 3, fontface="bold")+
labs(title = "Prediction by fishnet cells") +
mapTheme()
census_api_key("16325a5baedf1c986a182c3321f90022be22845a", overwrite = TRUE)
tracts20 <-
get_acs(geography = "tract",
variables = c("B25026_001E"),
year=2020, state=06, geometry=TRUE) %>%
st_transform('ESRI:102242')
fishnet_final_preds_only <- fishnet_final_preds %>% dplyr::select(uniqueID, Probs, geometry)
interpolate_probs <-
dplyr::select(fishnet_final_preds, Probs) %>%
st_interpolate_aw(., tracts20, extensive = FALSE) %>% st_sf()
interpolation_tracts <- st_join(interpolate_probs, tracts20, join=st_equals)
fireProb_tracts <- interpolation_tracts %>% dplyr::select(GEOID, geometry, Probs) %>% st_sf()
fireProb_tracts_nogeom <- interpolation_tracts %>% dplyr::select(GEOID, Probs) %>% st_drop_geometry()
final_norcal_only <-
st_intersection(st_make_valid(norcal2),
fireProb_tracts)
#6. write.csv
ggplot() +
geom_sf(data = final_norcal_only, aes(fill = Probs), color = NA) +
labs(title = "Prediction by Census Tracts") +
mapTheme()
We are excited to use our model to better predict the wildfire occurrence in northern California. We ran our model and as you can see from the results maps, one by fishnet and one by census tract, we predicted that the fires are more likely to happen in the middle part of the study area. By adding this layer into Google maps, users can avoid the fire occurring areas and find an optimized route to arrive at their destination more effectively and safely. In this model, we only predict the location of wildfires. Since we want to predict the wildfire risk by day, our model will update and run itself everyday with the changing historic or time data. For example, if we want to predict the wildfire risk for Dec. 30th, 2021, the model will use the wildfire history and weather data from Jan. 2nd to 2018 to Dec. 29, 2021. If we want to predict Dec. 30th, 2021, the model will use the wildfire history and weather data from Jan. 3rd, 2018 to Dec. 30th , 2021.
The fire probability generated by our model can better predict no fire occurrence, but it also has a fair generalizability in predicting fire occurrence, both of which can help the users avoid the fire area. To see the accuracy of our model, we used the confusion matrix method. The matrix of 0s and 1s shows the comparison between the observed and predicted instances of fire occurrence at the 41% threshold. Sensitivity, which is the ability to predict true positive outcomes is about 58%, and specificity, which means the ability to predict true negative outcome, is 96%. The result shows the model accomplishes the need.
For our next step, we are going to develop the application with Google and work on the time lag variables. Adding more time features to improve our model to predict more accurate prediction by time. In the long term, we will perform user testing and keep refining our model, and if our model works for the selected counties, we may promote this model to a larger geographic area. Also, we will develop customer services to collect fire data that can help us better predict the fire occurrence and improve our prediction in positive incidents, which may cause more harm when users pass the areas without any notice.