Bike share has become more and more popular in big cities since it is convenient for us to rent bike for short commuting from point to point. However, it is crucial for share bikes stations to balance their bike demand. It is always a issue when either a share bike station is empty or it is no vacant docking to deposit a bike. Whenever these kinds of cases happen, both the bike riders and the company may suffer a lot. Therefore, it is important to come up with a re-balancing strategy to better allocate the share bike distribution and the prediction of a share bike demand either spatially or temporally. In this way, we can better predict the demand and distribute the share bike strategically and ensure the users to have a great experience exploring the city by using our convenient and smart share bike system.
In this report, I built a prediction model for bike sharing demand based on Philadelphia’s Indego bike share system. The model has a moderate accuracy and generalizability to predict the demand temporally and spatially, which can further utilize in the re-balancing plans for Indego to predict each stations demand two weeks earlier and make plans for redistributing the bikes from station to station. Also, I considered factors includes time lags, holiday lags(Independence day for instance), and amenity features, such as bus stops, schools, parks, and sanitation centers) to better our model.
Furthermore, I suggest that share bike company can redistribute the bikes by a small fleets of trucks to move bikes from station to station. However, share bike is a business model in circular economy, so I suggests that companies can reward user for helping allocate the bikes so that we can reduce our uses of trucks to move around the bikes. By doing so, the carbon emission can also be minimized.
library(tidyverse)
library(sf)
library(lubridate)
library(tigris)
library(tidycensus)
library(viridis)
library(riem)
library(gridExtra)
library(knitr)
library(kableExtra)
library(RSocrata)
library(pillar)
library(dplyr)
library(FNN)
library(ggplot2)
library(gganimate)
library(gapminder)
library(gifski)
library(caret)
root.dir = "https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/"
source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")
plotTheme <- theme(
plot.title =element_text(size=12),
plot.subtitle = element_text(size=8),
plot.caption = element_text(size = 6),
axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
axis.text.y = element_text(size = 10),
axis.title.y = element_text(size = 10),
# Set the entire chart region to blank
panel.background=element_blank(),
plot.background=element_blank(),
#panel.border=element_rect(colour="#F0F0F0"),
# Format the grid
panel.grid.major=element_line(colour="#D0D0D0",size=.2),
axis.ticks=element_blank())
mapTheme <- theme(plot.title =element_text(size=12),
plot.subtitle = element_text(size=8),
plot.caption = element_text(size = 6),
axis.line=element_blank(),
axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks=element_blank(),
axis.title.x=element_blank(),
axis.title.y=element_blank(),
panel.background=element_blank(),
panel.border=element_blank(),
panel.grid.major=element_line(colour = 'transparent'),
panel.grid.minor=element_blank(),
legend.direction = "vertical",
legend.position = "right",
plot.margin = margin(1, 1, 1, 1, 'cm'),
legend.key.height = unit(1, "cm"), legend.key.width = unit(0.2, "cm"))
palette6 <- c("#eff3ff","#bdd7e7","#6baed6","#3182bd","#08519c","909580")
palette5 <- c("#eff3ff","#bdd7e7","#6baed6","#3182bd","#08519c")
palette4 <- c("#D2FBD4","#92BCAB","#527D82","#123F5A")
palette2 <- c("#6baed6","#08519c")
# Install Census API Key
tidycensus::census_api_key("16325a5baedf1c986a182c3321f90022be22845a", overwrite = TRUE, install = TRUE)
I read in the indego trip data of July, 2018 using
read.socrata
. Also, I filter out the station id of 3000,
which means unknown station. The weather in Philadelphia in July has a
pleasant temperature range, so we may see some leisure trips as well as
commutes.
indego <- read.csv("/Users/teresachang/Downloads/indego-trips-2018-q3.csv")
indego <-
filter(indego,indego$end_station != 3000)
I use date parsing to bin the data by 15 and 60 minute intervals by
rounding. I extract the week
of the observation, ranging
from 1-52 throughout the year, and the dotw
for day of the
week. And I only keep the data of the week 26 to 31 in this report.
indego <-
indego %>%
mutate(interval60 = floor_date(ymd_hms(start_time), unit = "hour"),
interval15 = floor_date(ymd_hms(start_time), unit = "15 mins"),
week = week(interval60),
dotw = wday(interval60, label=TRUE))
indego <-
indego %>%
filter(.,indego$week < 32)
#glimpse(indego)
Using the tidycensus
package, we can download census
geography and variables. I extract the tracts for mapping and joining
purposes.
phillyCensus <-
get_acs(geography = "tract",
variables = c("B01003_001", "B19013_001",
"B02001_002", "B08013_001",
"B08012_001", "B08301_001",
"B08301_010", "B01002_001"),
year = 2018,
state = "PA",
geometry = TRUE,
county=c("Philadelphia"),
output = "wide") %>%
rename(Total_Pop = B01003_001E,
Med_Inc = B19013_001E,
Med_Age = B01002_001E,
White_Pop = B02001_002E,
Travel_Time = B08013_001E,
Num_Commuters = B08012_001E,
Means_of_Transport = B08301_001E,
Total_Public_Trans = B08301_010E) %>%
select(Total_Pop, Med_Inc, White_Pop, Travel_Time,
Means_of_Transport, Total_Public_Trans,
Med_Age,
GEOID, geometry) %>%
mutate(Percent_White = White_Pop / Total_Pop,
Mean_Commute_Time = Travel_Time / Total_Public_Trans,
Percent_Taking_Public_Trans = Total_Public_Trans / Means_of_Transport)
phillyTracts <-
phillyCensus %>%
as.data.frame() %>%
distinct(GEOID, .keep_all = TRUE) %>%
select(GEOID, geometry) %>%
st_sf
I add amenity features of Septa bus stops, schools, park and recreation center, and sanitation convenience centers obtained from OpenDataPhilly https://www.opendataphilly.org/dataset. Then, I use the method of Nearest neighbor to add the features for Septa bus stops, schools, park and recreation center, and sanitation convenience centers as predictors to be put into the regression models.
busstops <- read_csv("/Users/teresachang/Downloads/shelter_locations.csv")%>%
st_as_sf(., coords = c("LONG.", "LAT."), crs = 4326)
## Rows: 297 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): STREET, INTERSECTI, ADDRESS
## dbl (2): LONG., LAT.
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
schools <- st_read("https://opendata.arcgis.com/datasets/d46a7e59e2c246c891fbee778759717e_0.geojson") %>%
st_set_crs('4326')
## Reading layer `Schools' from data source
## `https://opendata.arcgis.com/datasets/d46a7e59e2c246c891fbee778759717e_0.geojson'
## using driver `GeoJSON'
## Simple feature collection with 490 features and 15 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -75.26633 ymin: 39.90781 xmax: -74.97053 ymax: 40.12974
## Geodetic CRS: WGS 84
PPR <- st_read("/Users/teresachang/Downloads/PPR_Program_Sites.geojson")%>%
st_set_crs('4326')
## Reading layer `PPR_Program_Sites' from data source
## `/Users/teresachang/Downloads/PPR_Program_Sites.geojson' using driver `GeoJSON'
## Simple feature collection with 224 features and 10 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -75.26015 ymin: 39.89718 xmax: -74.96944 ymax: 40.12284
## Geodetic CRS: WGS 84
sanitation <- st_read("https://opendata.arcgis.com/datasets/129e346cc7654079980860b0656587a5_0.geojson") %>%
st_set_crs('4326')
## Reading layer `Sanitation_Convenience_Centers' from data source
## `https://opendata.arcgis.com/datasets/129e346cc7654079980860b0656587a5_0.geojson'
## using driver `GeoJSON'
## Simple feature collection with 6 features and 4 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -75.24008 ymin: 39.91952 xmax: -75.01131 ymax: 40.03858
## Geodetic CRS: WGS 84
startstop <-
indego %>%
filter(is.na(start_lon) == FALSE &
is.na(start_lat) == FALSE &
is.na(end_lon) == FALSE &
is.na(end_lat) == FALSE) %>%
st_as_sf(., coords = c("start_lon", "start_lat"), crs = 4326)
indego <-
indego %>%
filter(is.na(start_lon) == FALSE &
is.na(start_lat) == FALSE &
is.na(end_lon) == FALSE &
is.na(end_lat) == FALSE) %>%
mutate(
schools_nn3 = nn_function(na.omit(st_coordinates(startstop)), na.omit(st_coordinates(schools)), 3),
busstops_nn3 = nn_function(na.omit(st_coordinates(startstop)), na.omit(st_coordinates(busstops)), 3),
PPR_nn3 = nn_function(na.omit(st_coordinates(startstop)), na.omit(st_coordinates(PPR)), 3),
sanitation_nn3 = nn_function(na.omit(st_coordinates(startstop)), na.omit(st_coordinates(sanitation)), 3)
)
I add the spatial information to our bike share data as origin and destination data, first joining the origin station, then the destination station to our census data. However, I don’t use the destination data in this report.
bike_census <- st_join(indego %>%
filter(is.na(start_lon) == FALSE &
is.na(start_lat) == FALSE &
is.na(end_lon) == FALSE &
is.na(end_lat) == FALSE) %>%
st_as_sf(., coords = c("start_lon", "start_lat"), crs = 4326),
phillyTracts %>%
st_transform(crs=4326),
join=st_intersects,
left = TRUE) %>%
rename(Origin.Tract = GEOID) %>%
mutate(start_lon = unlist(map(geometry, 1)),
start_lat = unlist(map(geometry, 2)))%>%
as.data.frame() %>%
dplyr::select(-geometry)%>%
st_as_sf(., coords = c("end_lon", "end_lat"), crs = 4326) %>%
st_join(., phillyTracts %>%
st_transform(crs=4326),
join=st_intersects,
left = TRUE) %>%
rename(Destination.Tract = GEOID) %>%
mutate(end_lon = unlist(map(geometry, 1)),
end_lat = unlist(map(geometry, 2)))%>%
as.data.frame() %>%
dplyr::select(-geometry)
Import weather data from Philadelphia International Airport (code
KPHL) using riem_measures
from 2018/7/1 to 2018/8/5, and get
temperature, wind speed, precipitation on an hourly basis and plot the
temperature and precipitation trends over our study period. I assume
that while raining days and high temperature days, people tend to not
ride bikes.
weather.Panel <-
riem_measures(station = "KPHL", date_start = "2018-07-01", date_end = "2018-08-05") %>%
dplyr::select(valid, tmpf, p01i, sknt)%>%
replace(is.na(.), 0) %>%
mutate(interval60 = ymd_h(substr(valid,1,13))) %>%
mutate(week = week(interval60),
dotw = wday(interval60, label=TRUE)) %>%
group_by(interval60) %>%
summarize(Temperature = max(tmpf),
Precipitation = sum(p01i),
Wind_Speed = max(sknt)) %>%
mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))
#glimpse(weather.Panel)
grid.arrange(
ggplot(weather.Panel, aes(interval60,Precipitation)) + geom_line() +
labs(title="Percipitation", x="Hour", y="Perecipitation") + plotTheme,
ggplot(weather.Panel, aes(interval60,Wind_Speed)) + geom_line() +
labs(title="Wind Speed", x="Hour", y="Wind Speed") + plotTheme,
ggplot(weather.Panel, aes(interval60,Temperature)) + geom_line() +
labs(title="Temperature", x="Hour", y="Temperature") + plotTheme,
top="Weather Data - Philadelphia KPHL - 7/1 to 8/5, 2018")
I examine the time and frequency components of our data. First, we look at the overall time pattern. Note that it is Independence Day on 7/4, which means that holidays may influence the normal pattern in weekday.
ggplot(bike_census %>%
group_by(interval60) %>%
tally())+
geom_line(aes(x = interval60, y = n))+
labs(title="Bike share trips per hr. Philadelphia, 7/1 to 8/5, 2018",
x="Date",
y="Number of trips")+
plotTheme
I examine the distribution of trip volume by station for different times of the day. The mean number of trips each station shows different patterns during am rush hours, pm rush hours, mid-day, and overnight. It is obvious that mid-day has the highest frequency in 2 trips overall. Also, PM ride has the largest number of rides by 8, which means people are more likely to ride a bike in the pm rush hours.
bike_census %>%
mutate(time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
group_by(interval60, start_station, time_of_day) %>%
tally()%>%
group_by(start_station, time_of_day)%>%
summarize(mean_trips = mean(n))%>%
ggplot()+
geom_histogram(aes(mean_trips), binwidth = 1)+
labs(title="Mean Number of Hourly Trips Per Station. Philadelphia, 7/1 to 8/5, 2018",
x="Number of trips",
y="Frequency")+
facet_wrap(~time_of_day)+
plotTheme
Histogram of bike share trips per hour by station is made. We can see that sost stations have less than 20 bike rides per hour.
ggplot(bike_census %>%
group_by(interval60, start_station) %>%
tally())+
geom_histogram(aes(n), binwidth = 5)+
labs(title="Bike share trips per hr by station. Philadelphia, 7/1 to 8/5, 2018",
x="Trip Counts",
y="Number of Stations")+
plotTheme
The plot of bike share trips distribution with time, sorted by weekdays and weekends is made. It can be observe that the peaks occur in am and pm rush hours in weekdays. However, it shows only one peak in weekends, which is around 3 pm when many people might go out in the afternoon for some activities.
ggplot(bike_census %>% mutate(hour = hour(start_time)))+
geom_freqpoly(aes(hour, color = dotw), binwidth = 1)+
labs(title="Bike share trips in Philadelphia, by day of the week, 7/1 to 8/5, 2018",
x="Hour",
y="Trip Counts")+
plotTheme
ggplot(bike_census %>%
mutate(hour = hour(start_time),
weekend = ifelse(dotw %in% c("日", "六"), "Weekend", "Weekday")))+
geom_freqpoly(aes(hour, color = weekend), binwidth = 1)+
labs(title="Bike share trips in Philadelphia - weekend vs weekday, 7/1 to 8/5, 2018",
x="Hour",
y="Trip Counts")+
plotTheme
The origin map for the bike trips in Philadelphia is made below. The maps are sorted by weekdays/weekends and different time categories of the day. It is obvious that more trips happened in weekdays than weekends. Also, there are more trips in pm rush hours. Besides, trips mostly happens in the center city of Philadelphia.
ggplot()+
geom_sf(data = phillyTracts %>%
st_transform(crs=4326))+
geom_point(data = bike_census %>%
mutate(hour = hour(start_time),
weekend = ifelse(dotw %in% c("日", "六"), "Weekend", "Weekday"),
time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
group_by(start_station, start_lat, start_lon, weekend, time_of_day) %>%
tally(),
aes(x=start_lon, y = start_lat, color = n),
fill = "transparent", alpha = 0.4, size = 0.4)+
scale_colour_viridis(direction = -1,
discrete = FALSE, option = "D")+
ylim(min(bike_census$start_lat), max(bike_census$start_lat))+
xlim(min(bike_census$start_lon), max(bike_census$start_lon))+
facet_grid(weekend ~ time_of_day)+
labs(title="Bike share trips per hr by station. Philadelphia, 7/1 to 8/5, 2018")+
mapTheme
A space-time panel is created for future regression model. First, I make sure each unique station and hour/day combo exists in our data set, which is done in order to create a panel data set where each time period in the study is represented by a row. If a station didn’t have any trips originating from it at a given hour, I add a zero in that spot in the panel. Besides, all other features needed for model training are also involved in the space-time panel.
#length(unique(bike_census$interval60)) * length(unique(bike_census$start_station))
study.panel <-
expand.grid(interval60=unique(bike_census$interval60),
start_station = unique(bike_census$start_station)) %>%
left_join(., bike_census %>%
select(start_station, Origin.Tract, start_lon, start_lat, busstops_nn3, schools_nn3, PPR_nn3,
sanitation_nn3)%>%
distinct() %>%
group_by(start_station) %>%
slice(1))
#nrow(study.panel)
ride.panel <-
bike_census %>%
mutate(Trip_Counter = 1) %>%
right_join(study.panel) %>%
group_by(interval60, start_station, Origin.Tract, start_lon, start_lat, busstops_nn3, schools_nn3, PPR_nn3,
sanitation_nn3) %>%
summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
left_join(weather.Panel) %>%
ungroup() %>%
filter(is.na(start_station) == FALSE) %>%
mutate(week = week(interval60),
dotw = wday(interval60, label = TRUE)) %>%
filter(is.na(Origin.Tract) == FALSE)
ride.panel <-
left_join(ride.panel, phillyCensus %>%
as.data.frame() %>%
select(-geometry), by = c("Origin.Tract" = "GEOID"))
I add additional nuance about the demand during a given time period by creating time lag variables. Time lags of 1 hour, 2 hours, 3 hours, 4 hours, 12 hours, and 1 day are created in the panel. Also, I try to control for the effects of holidays that disrupt the expected demand during a given weekend or weekday. There is a holiday on July 4th, Independence Day. For that five day weekend I use some variables indicating temporal proximity to the holiday.
ride.panel <-
ride.panel %>%
arrange(start_station, interval60) %>%
mutate(lagHour = dplyr::lag(Trip_Count,1),
lag2Hours = dplyr::lag(Trip_Count,2),
lag3Hours = dplyr::lag(Trip_Count,3),
lag4Hours = dplyr::lag(Trip_Count,4),
lag12Hours = dplyr::lag(Trip_Count,12),
lag1day = dplyr::lag(Trip_Count,24),
holiday = ifelse(yday(interval60) == 185,1,0)) %>%
mutate(day = yday(interval60)) %>%
mutate(holidayLag = case_when(dplyr::lag(holiday, 1) == 1 ~ "PlusOneDay",
dplyr::lag(holiday, 2) == 1 ~ "PlustTwoDays",
dplyr::lag(holiday, 3) == 1 ~ "PlustThreeDays",
dplyr::lead(holiday, 1) == 1 ~ "MinusOneDay",
dplyr::lead(holiday, 2) == 1 ~ "MinusTwoDays",
dplyr::lead(holiday, 3) == 1 ~ "MinusThreeDays"),
holidayLag = ifelse(is.na(holidayLag) == TRUE, 0, holidayLag))
The table provided below is the correlations in these lags. We can see that 1 hour lag and 1 day lag have the top 2 largest positive correlation with the bike share data while the 12 hour lag has the largest negative correlation. This means the bike share demand has a high similarity with the near future and the next day. Also, it is totally different in the same station after 12 hours since people user scenario and behavior may change a lot after 12 hours.
as.data.frame(ride.panel) %>%
group_by(interval60) %>%
summarise_at(vars(starts_with("lag"), "Trip_Count"), mean, na.rm = TRUE) %>%
gather(Variable, Value, -interval60, -Trip_Count) %>%
mutate(Variable = factor(Variable, levels=c("lagHour","lag2Hours","lag3Hours","lag4Hours",
"lag12Hours","lag1day","holidayLag")))%>%
group_by(Variable) %>%
summarize(correlation = round(cor(Value, Trip_Count),2))
## # A tibble: 6 × 2
## Variable correlation
## <fct> <dbl>
## 1 lagHour 0.81
## 2 lag2Hours 0.55
## 3 lag3Hours 0.33
## 4 lag4Hours 0.14
## 5 lag12Hours -0.35
## 6 lag1day 0.76
To better illustrate the change of numbers of bike trips each station with time, an animated map is generated. Trip count is of the 60-minute intervals, and a whole week of week 27 is put into the map.
stations<-
indego %>%
dplyr::select(start_station, start_lon, start_lat) %>%
group_by(start_station) %>%
distinct()%>%
filter(is.na(start_lon) == FALSE &
is.na(start_lat) == FALSE) %>%
st_as_sf(., coords = c("start_lon", "start_lat"), crs = 4326)
week26 <-
filter(bike_census , week == 27)
week26.panel <-
expand.grid(
interval60 = unique(week26$interval60),
Origin.Tract = unique(ride.panel$Origin.Tract))
bike.animation.data <-
mutate(week26, Trip_Counter = 1) %>%
right_join(week26) %>%
group_by(interval60, start_station) %>%
summarise(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
ungroup() %>%
left_join(stations, by=c("start_station" = "start_station")) %>%
st_sf() %>%
mutate(Trips = case_when(Trip_Count == 0 ~ "0 trips",
Trip_Count == 1 ~ "1 trips",
Trip_Count == 2 ~ "2 trips",
Trip_Count > 2 & Trip_Count <= 4 ~ "3-4 trips",
Trip_Count > 4 ~ "5+ trips")) %>%
mutate(Trips = fct_relevel(Trips, "0 trips","1 trips","2 trips",
"3-4 trips","5+ trips"))
## Joining, by = c("trip_id", "duration", "start_time", "end_time",
## "start_station", "end_station", "bike_id", "plan_duration",
## "trip_route_category", "passholder_type", "bike_type", "interval60",
## "interval15", "week", "dotw", "schools_nn3", "busstops_nn3", "PPR_nn3",
## "sanitation_nn3", "Origin.Tract", "start_lon", "start_lat",
## "Destination.Tract", "end_lon", "end_lat")
## `summarise()` has grouped output by 'interval60'. You can override using the
## `.groups` argument.
## Warning: 1 unknown level in `f`: 0 trips
bike_animation <-
ggplot() +
geom_sf(data = phillyTracts %>%
st_transform(crs=4326))+
geom_sf(data = bike.animation.data,
aes(color = Trips), size = 2) +
scale_color_manual(values = palette4) +
ylim(min(bike_census$start_lat), max(bike_census$start_lat))+
xlim(min(bike_census$start_lon), max(bike_census$start_lon))+
labs(title = "Bike Trips by Start Stations for Week 26, 2018",
subtitle = "60-minute intervals: {current_frame}") +
transition_manual(interval60)
animate(bike_animation, duration=50, renderer = gifski_renderer())
I split our data into a training set by 3 weeks and a test set by 2 weeks. I create six linear models or predicting bike rides in Philadelphia.
Model 1: time interval, day of the week, temperature
Model 2: start station, day of the week, temperature
Model 3: start station, time interval, day of the week, temperature, precipitation
Model 4: start station, time interval, day of the week, temperature, precipitation, time lags
Model 5: start station, time interval, day of the week, temperature, precipitation, time lags, holiday lags
Model 6: start station, time interval, day of the week, temperature, precipitation, time lags, holiday lags, amenity features
ride.Train <- filter(ride.panel, week >= 30)
ride.Test <- filter(ride.panel, week < 30)
reg1 <-
lm(Trip_Count ~ hour(interval60) + dotw + Temperature, data=ride.Train)
reg2 <-
lm(Trip_Count ~ start_station + dotw + Temperature, data=ride.Train)
reg3 <-
lm(Trip_Count ~ start_station + hour(interval60) + dotw + Temperature + Precipitation,
data=ride.Train)
reg4 <-
lm(Trip_Count ~ start_station + hour(interval60) + dotw + Temperature + Precipitation +
lagHour + lag2Hours +lag3Hours + lag12Hours + lag1day,
data=ride.Train)
reg5 <-
lm(Trip_Count ~ start_station + hour(interval60) + dotw + Temperature + Precipitation +
lagHour + lag2Hours +lag3Hours +lag12Hours + lag1day + holiday,
data=ride.Train)
reg6 <-
lm(Trip_Count ~ start_station + hour(interval60) + dotw + Temperature + Precipitation +
lagHour + lag2Hours +lag3Hours +lag12Hours + lag1day + holiday +
schools_nn3 + busstops_nn3 + PPR_nn3 + sanitation_nn3,
data=ride.Train)
The predictions using the 5 models above generate the results are shown in the table below.
ride.Test.weekNest <-
ride.Test %>%
nest(-week)
model_pred <- function(dat, fit){
pred <- predict(fit, newdata = dat)}
week_predictions <-
ride.Test.weekNest %>%
mutate(ATime_FE = map(.x = data, fit = reg1, .f = model_pred),
BSpace_FE = map(.x = data, fit = reg2, .f = model_pred),
CTime_Space_FE = map(.x = data, fit = reg3, .f = model_pred),
DTime_Space_FE_timeLags = map(.x = data, fit = reg4, .f = model_pred),
ETime_Space_FE_timeLags_holidayLags = map(.x = data, fit = reg5, .f = model_pred),
FTime_Space_FE_timeLags_amenity = map(.x = data, fit = reg6, .f = model_pred)) %>%
gather(Regression, Prediction, -data, -week) %>%
mutate(Observed = map(data, pull, Trip_Count),
Absolute_Error = map2(Observed, Prediction, ~ abs(.x - .y)),
MAE = map_dbl(Absolute_Error, mean, na.rm = TRUE),
sd_AE = map_dbl(Absolute_Error, sd, na.rm = TRUE))
## Warning in predict.lm(fit, newdata = dat): 用秩缺乏擬合來進行預測的結果很可能不
## 可靠
## Warning in predict.lm(fit, newdata = dat): 用秩缺乏擬合來進行預測的結果很可能不
## 可靠
## Warning in predict.lm(fit, newdata = dat): 用秩缺乏擬合來進行預測的結果很可能不
## 可靠
## Warning in predict.lm(fit, newdata = dat): 用秩缺乏擬合來進行預測的結果很可能不
## 可靠
## Warning in predict.lm(fit, newdata = dat): 用秩缺乏擬合來進行預測的結果很可能不
## 可靠
## Warning in predict.lm(fit, newdata = dat): 用秩缺乏擬合來進行預測的結果很可能不
## 可靠
## Warning in predict.lm(fit, newdata = dat): 用秩缺乏擬合來進行預測的結果很可能不
## 可靠
## Warning in predict.lm(fit, newdata = dat): 用秩缺乏擬合來進行預測的結果很可能不
## 可靠
week_predictions
## # A tibble: 24 × 8
## week data Regression Predi…¹ Obser…² Absol…³ MAE sd_AE
## <dbl> <list> <chr> <list> <list> <list> <dbl> <dbl>
## 1 26 <tibble [3,072 × 33]> ATime_FE <dbl> <dbl> <dbl> 0.927 0.815
## 2 27 <tibble [21,376 × 33]> ATime_FE <dbl> <dbl> <dbl> 0.982 1.08
## 3 28 <tibble [21,504 × 33]> ATime_FE <dbl> <dbl> <dbl> 1.05 1.28
## 4 29 <tibble [21,504 × 33]> ATime_FE <dbl> <dbl> <dbl> 1.04 1.24
## 5 26 <tibble [3,072 × 33]> BSpace_FE <dbl> <dbl> <dbl> 1.10 0.767
## 6 27 <tibble [21,376 × 33]> BSpace_FE <dbl> <dbl> <dbl> 1.04 1.08
## 7 28 <tibble [21,504 × 33]> BSpace_FE <dbl> <dbl> <dbl> 1.08 1.26
## 8 29 <tibble [21,504 × 33]> BSpace_FE <dbl> <dbl> <dbl> 1.04 1.24
## 9 26 <tibble [3,072 × 33]> CTime_Space… <dbl> <dbl> <dbl> 0.921 0.818
## 10 27 <tibble [21,376 × 33]> CTime_Space… <dbl> <dbl> <dbl> 0.967 1.08
## # … with 14 more rows, and abbreviated variable names ¹Prediction, ²Observed,
## # ³Absolute_Error
The best models, the lag and amenity models, are accurate to less than an average of 0.8 ride per hour, which is pretty alright for overall accuracy. As the MAE results shown in the bar plot below, model 4, 5, and 6 have the similarly best performance in accurately predicting the bike rides. This shows that time lags and amenity features are really essential for bike share prediction, better than time, space, and weather features. Model 6 is a bit better that Model 4 and 5, which means that amenity features are effective to the prediction.
week_predictions %>%
dplyr::select(week, Regression, MAE) %>%
gather(Variable, MAE, -Regression, -week) %>%
ggplot(aes(week, MAE)) +
geom_bar(aes(fill = Regression), position = "dodge", stat="identity") +
scale_fill_manual(values = palette6) +
labs(title = "Mean Absolute Errors by model specification and week") +
plotTheme
The plot below shows the time series of predicted and observed bike trips. The prediction of model 4, 5 and 6 fit well with the observed bike trips.
week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
start_station = map(data, pull, start_station)) %>%
dplyr::select(interval60, start_station, Observed, Prediction, Regression) %>%
unnest() %>%
gather(Variable, Value, -Regression, -interval60, -start_station) %>%
group_by(Regression, Variable, interval60) %>%
summarize(Value = sum(Value)) %>%
ggplot(aes(interval60, Value, colour=Variable)) +
geom_line(size = 1.1) +
facet_wrap(~Regression, ncol=1) +
labs(title = "Predicted/Observed bike share time series", subtitle = "Philadelphia; A test set of 2 weeks", x = "Hour", y= "Station Trips") +
plotTheme
The spatial distribution of MAEs is presented below. Higher MAEs concentrate in the center city of Philadelphia. Hence, the prediction is better in the surrounding bike stations. Besides, a relatively worse spatial generalizability in this model due to the difference in spatial distribution of MAE.
week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
start_station = map(data, pull, start_station),
start_lat = map(data, pull, start_lat),
start_lon = map(data, pull, start_lon)) %>%
select(interval60, start_station, start_lon, start_lat, Observed, Prediction, Regression) %>%
unnest() %>%
filter(Regression == "FTime_Space_FE_timeLags_amenity") %>%
group_by(start_station, start_lon, start_lat) %>%
summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.)+
geom_sf(data = phillyCensus, color = "grey", fill = "transparent")+
geom_point(aes(x = start_lon, y = start_lat, color = MAE),
fill = "transparent", alpha = 0.4)+
scale_colour_viridis(direction = -1,
discrete = FALSE, option = "D")+
ylim(min(bike_census$start_lat), max(bike_census$start_lat))+
xlim(min(bike_census$start_lon), max(bike_census$start_lon))+
labs(title="Mean Abs Error, Test Set, Model 6")+
mapTheme
I use the cross validation for 6 models to test the generalizability. I use k-fold cross validation algorithm in this report. Cross-validation ensures that the goodness-of-fit results for a single hold is not by chance.
fitControl <- trainControl(method = "cv", number = 100)
set.seed(825)
reg.cv1 <-
train(Trip_Count ~ ., data = ride.Train %>%
dplyr::select(Trip_Count,
interval60 , dotw , Temperature),
method = "lm", trControl = fitControl, na.action = na.pass)
reg.cv1
## Linear Regression
##
## 43008 samples
## 3 predictor
##
## No pre-processing
## Resampling: Cross-Validated (100 fold)
## Summary of sample sizes: 42578, 42578, 42578, 42578, 42578, 42578, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 1.590366 0.01880733 1.040999
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
reg.cv1$resample[1:5,]
## RMSE Rsquared MAE Resample
## 1 1.483150 0.006809894 1.038448 Fold001
## 2 1.619906 0.005013852 1.040655 Fold002
## 3 1.895653 0.030796279 1.135697 Fold003
## 4 1.793673 0.021019432 1.070515 Fold004
## 5 1.444081 0.007530939 0.998659 Fold005
reg.cv2 <-
train(Trip_Count ~ ., data = ride.Train %>%
dplyr::select(Trip_Count,
start_station , dotw , Temperature),
method = "lm", trControl = fitControl, na.action = na.pass)
reg.cv2
## Linear Regression
##
## 43008 samples
## 3 predictor
##
## No pre-processing
## Resampling: Cross-Validated (100 fold)
## Summary of sample sizes: 42578, 42578, 42578, 42578, 42578, 42577, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 1.576475 0.0347273 1.028526
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
reg.cv2$resample[1:5,]
## RMSE Rsquared MAE Resample
## 1 1.393173 0.06217179 0.9655907 Fold001
## 2 2.272312 0.04695982 1.1466700 Fold002
## 3 1.734636 0.03392971 1.1030553 Fold003
## 4 1.418810 0.04592763 1.0170050 Fold004
## 5 1.548500 0.02947415 1.0369431 Fold005
reg.cv3 <-
train(Trip_Count ~ ., data = ride.Train %>%
dplyr::select(Trip_Count,
start_station ,interval60, dotw , Temperature, Precipitation),
method = "lm", trControl = fitControl, na.action = na.pass)
reg.cv3
## Linear Regression
##
## 43008 samples
## 5 predictor
##
## No pre-processing
## Resampling: Cross-Validated (100 fold)
## Summary of sample sizes: 42578, 42578, 42578, 42578, 42577, 42578, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 1.576498 0.03681374 1.026565
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
reg.cv3$resample[1:5,]
## RMSE Rsquared MAE Resample
## 1 1.445005 0.03762498 0.9913639 Fold001
## 2 1.366832 0.05581451 0.9642802 Fold002
## 3 1.770422 0.02731340 1.0226374 Fold003
## 4 1.707593 0.04964061 0.9863379 Fold004
## 5 1.278935 0.05396673 0.9271464 Fold005
reg.cv4 <-
train(Trip_Count ~ ., data = ride.Train %>%
dplyr::select(Trip_Count,
start_station ,interval60, dotw, Temperature, Precipitation,
lagHour, lag2Hours, lag3Hours, lag12Hours, lag1day),
method = "lm", trControl = fitControl, na.action = na.pass)
reg.cv4
## Linear Regression
##
## 43008 samples
## 10 predictor
##
## No pre-processing
## Resampling: Cross-Validated (100 fold)
## Summary of sample sizes: 42578, 42578, 42578, 42577, 42578, 42578, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 1.304854 0.3409077 0.8152613
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
reg.cv4$resample[1:5,]
## RMSE Rsquared MAE Resample
## 1 1.314264 0.3048313 0.8184091 Fold001
## 2 1.364079 0.3859482 0.8411141 Fold002
## 3 1.264572 0.2101985 0.8295053 Fold003
## 4 1.500862 0.3729488 0.8794841 Fold004
## 5 1.305940 0.4745198 0.8579033 Fold005
reg.cv5 <-
train(Trip_Count ~ ., data = ride.Train %>%
dplyr::select(Trip_Count,
start_station ,interval60, dotw, Temperature, Precipitation,
lagHour, lag2Hours, lag3Hours, lag12Hours, lag1day, holiday),
method = "lm", trControl = fitControl, na.action = na.pass)
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
reg.cv5
## Linear Regression
##
## 43008 samples
## 11 predictor
##
## No pre-processing
## Resampling: Cross-Validated (100 fold)
## Summary of sample sizes: 42578, 42577, 42578, 42578, 42578, 42578, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 1.303951 0.343506 0.8152864
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
reg.cv5$resample[1:5,]
## RMSE Rsquared MAE Resample
## 1 1.218829 0.4977750 0.7729427 Fold001
## 2 1.407045 0.2565754 0.8717411 Fold002
## 3 1.176218 0.3863221 0.7645609 Fold003
## 4 1.454991 0.2274118 0.8556157 Fold004
## 5 1.347985 0.2536958 0.8756964 Fold005
reg.cv6 <-
train(Trip_Count ~ ., data = ride.Train %>%
dplyr::select(Trip_Count,
start_station ,interval60, dotw, Temperature, Precipitation,
lagHour, lag2Hours, lag3Hours, lag12Hours, lag1day, holiday, schools_nn3, busstops_nn3, sanitation_nn3, PPR_nn3),
method = "lm", trControl = fitControl, na.action = na.pass)
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠
reg.cv6
## Linear Regression
##
## 43008 samples
## 15 predictor
##
## No pre-processing
## Resampling: Cross-Validated (100 fold)
## Summary of sample sizes: 42578, 42578, 42578, 42578, 42577, 42579, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 1.304107 0.3424148 0.8137612
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
reg.cv6$resample[1:5,]
## RMSE Rsquared MAE Resample
## 1 1.335843 0.3969327 0.8253428 Fold001
## 2 1.340108 0.3019894 0.8412230 Fold002
## 3 1.217312 0.2789517 0.7766859 Fold003
## 4 1.409614 0.4322938 0.8768652 Fold004
## 5 1.182172 0.3110342 0.8205507 Fold005
The distribution of MAE of model 5 cluster quite tightly together, which mostly lie between 0.77 and 0.82, and this suggest that model 5 have a good generalizability.
MAE5 <- reg.cv5$resample
ggplot(MAE5, aes(x=MAE)) +
geom_histogram(color="black", fill="white") +
xlim(0, 1) +
labs(title = "Histogram of MAE of Cross Validation", subtitle = "Model:ETime_Space_FE_timeLags_holidayLags", x = "MAE", y= "Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing missing values (geom_bar).
The scatter plot of observed and predicted bike trips is presented below. We can see that the red line of predicted results is under the black line of perfect model. This means that there is no significant difference in model errors in time.
week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
start_station = map(data, pull, start_station),
start_lat = map(data, pull, start_lat),
start_lon = map(data, pull, start_lon),
dotw = map(data, pull, dotw)) %>%
select(interval60, start_station, start_lon,
start_lat, Observed, Prediction, Regression,
dotw) %>%
unnest() %>%
filter(Regression == "FTime_Space_FE_timeLags_amenity")%>%
mutate(weekend = ifelse(dotw %in% c("日", "六"), "Weekend", "Weekday"),
time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
ggplot()+
geom_point(aes(x= Observed, y = Prediction))+
geom_smooth(aes(x= Observed, y= Prediction), method = "lm", se = FALSE, color = "red")+
geom_abline(slope = 1, intercept = 0)+
facet_grid(time_of_day~weekend)+
labs(title="Observed vs Predicted",
x="Observed trips",
y="Predicted trips")+
plotTheme
We can see that there is a huge difference in the distribution of MAE during different time. Specifically, on weekdays, there are higher MAE stations in the center city during am , mid-day and pm rush hours, while the MAEs is quite similar in overnight periods. On weekends, higher MAE stations occur in the center city during mid-day and pm rush hours, while the MAEs is quite similar during am rush hours and overnight periods.
week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
start_station = map(data, pull, start_station),
start_lat = map(data, pull, start_lat),
start_lon = map(data, pull, start_lon),
dotw = map(data, pull, dotw) ) %>%
select(interval60, start_station, start_lon,
start_lat, Observed, Prediction, Regression,
dotw) %>%
unnest() %>%
filter(Regression == "FTime_Space_FE_timeLags_amenity")%>%
mutate(weekend = ifelse(dotw %in% c("日", "六"), "Weekend", "Weekday"),
time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
group_by(start_station, weekend, time_of_day, start_lon, start_lat) %>%
summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.)+
geom_sf(data = phillyCensus, color = "grey", fill = "transparent")+
geom_point(aes(x = start_lon, y = start_lat, color = MAE),
fill = "transparent", size = 0.5, alpha = 0.4)+
scale_colour_viridis(direction = -1,
discrete = FALSE, option = "D")+
ylim(min(bike_census$start_lat), max(bike_census$start_lat))+
xlim(min(bike_census$start_lon), max(bike_census$start_lon))+
facet_grid(weekend~time_of_day)+
labs(title="Mean Absolute Errors, Test Set")+
mapTheme
Lastly, I focus on the morning commute, where station locations probably relate to likely users, who seem to be commuting to center city. There are tends that as median household income or percentage of white goes up, the MAE goes up. Also, as percentage of taking public transit goes up, the MAE goes down.
week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
start_station = map(data, pull, start_station),
start_lat = map(data, pull, start_lat),
start_lon = map(data, pull, start_lon),
dotw = map(data, pull, dotw),
Percent_Taking_Public_Trans = map(data, pull, Percent_Taking_Public_Trans),
Med_Inc = map(data, pull, Med_Inc),
Percent_White = map(data, pull, Percent_White)) %>%
select(interval60, start_station, start_lon,
start_lat, Observed, Prediction, Regression,
dotw, Percent_Taking_Public_Trans, Med_Inc, Percent_White) %>%
unnest() %>%
filter(Regression == "FTime_Space_FE_timeLags_amenity")%>%
mutate(weekend = ifelse(dotw %in% c("日", "六"), "Weekend", "Weekday"),
time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
filter(time_of_day == "AM Rush") %>%
group_by(start_station, Percent_Taking_Public_Trans, Med_Inc, Percent_White) %>%
summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
gather(-start_station, -MAE, key = "variable", value = "value")%>%
ggplot(.)+
#geom_sf(data = phillyCensus, color = "grey", fill = "transparent")+
geom_point(aes(x = value, y = MAE), alpha = 0.4)+
geom_smooth(aes(x = value, y = MAE), method = "lm", se= FALSE)+
facet_wrap(~variable, scales = "free")+
labs(title="Errors as a function of socio-economic variables",
y="Mean Absolute Error (Trips)")+
plotTheme
In conclusion, the model 6 can be applied in the bike sharing re-balancing plan to better predict the share bike demand for Indego bike share system in Philadelphia. Model 6 includes features of time interval, day of the week, temperature, precipitation, time lags, holiday lags, and amenity features. First, it has an acceptable accuracy by approximately 0.8 of mean MAE, which means the error of predicting bike share demands is around 0.8 bike. It is acceptable because less than one bike demand might be predicted wrong in each station, which doesn’t impact a lot. Besides, the model 6 shows a quite good generalizability showed in cross validation. A good generalizability shows that it is quite reliable for us to use the model to predict the bike demand in the future uses. Lastly, the model 6 shows a great fit of observed data and predicted data showed in the plot presented above. Therefore, Model 6 can be usefull in predicting bike share demand for the future two weeks and help INdego to optimize and set up re-balancing bikes plans in advance.
There are still some limitation for the model. It was showed that the spatial distribution of the MAEs is quit uneven, which will make predictions in some areas not that accurate, specifically in Center City, Philadelphia. The model still can be improve by better its MAE’s spatial distribution and the accuracy since the riders can ride across areas and should have a similar experience in each station. Moreover, more features can be added to better our model since the amenity features for now can only improve slightly on the model.