Space-Time Prediction of Bike Share Demand in Philadelphia

1.1. Introduction

Bike share has become more and more popular in big cities since it is convenient for us to rent bike for short commuting from point to point. However, it is crucial for share bikes stations to balance their bike demand. It is always a issue when either a share bike station is empty or it is no vacant docking to deposit a bike. Whenever these kinds of cases happen, both the bike riders and the company may suffer a lot. Therefore, it is important to come up with a re-balancing strategy to better allocate the share bike distribution and the prediction of a share bike demand either spatially or temporally. In this way, we can better predict the demand and distribute the share bike strategically and ensure the users to have a great experience exploring the city by using our convenient and smart share bike system.

In this report, I built a prediction model for bike sharing demand based on Philadelphia’s Indego bike share system. The model has a moderate accuracy and generalizability to predict the demand temporally and spatially, which can further utilize in the re-balancing plans for Indego to predict each stations demand two weeks earlier and make plans for redistributing the bikes from station to station. Also, I considered factors includes time lags, holiday lags(Independence day for instance), and amenity features, such as bus stops, schools, parks, and sanitation centers) to better our model.

Furthermore, I suggest that share bike company can redistribute the bikes by a small fleets of trucks to move bikes from station to station. However, share bike is a business model in circular economy, so I suggests that companies can reward user for helping allocate the bikes so that we can reduce our uses of trucks to move around the bikes. By doing so, the carbon emission can also be minimized.

2.1. Setup

library(tidyverse)
library(sf)
library(lubridate)
library(tigris)
library(tidycensus) 
library(viridis)
library(riem)
library(gridExtra)
library(knitr)
library(kableExtra)
library(RSocrata)
library(pillar)
library(dplyr)
library(FNN)
library(ggplot2)
library(gganimate)
library(gapminder)
library(gifski)
library(caret)

root.dir = "https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/"
source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")


plotTheme <- theme(
  plot.title =element_text(size=12),
  plot.subtitle = element_text(size=8),
  plot.caption = element_text(size = 6),
  axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
  axis.text.y = element_text(size = 10),
  axis.title.y = element_text(size = 10),
  # Set the entire chart region to blank
  panel.background=element_blank(),
  plot.background=element_blank(),
  #panel.border=element_rect(colour="#F0F0F0"),
  # Format the grid
  panel.grid.major=element_line(colour="#D0D0D0",size=.2),
  axis.ticks=element_blank())

mapTheme <- theme(plot.title =element_text(size=12),
                  plot.subtitle = element_text(size=8),
                  plot.caption = element_text(size = 6),
                  axis.line=element_blank(),
                  axis.text.x=element_blank(),
                  axis.text.y=element_blank(),
                  axis.ticks=element_blank(),
                  axis.title.x=element_blank(),
                  axis.title.y=element_blank(),
                  panel.background=element_blank(),
                  panel.border=element_blank(),
                  panel.grid.major=element_line(colour = 'transparent'),
                  panel.grid.minor=element_blank(),
                  legend.direction = "vertical", 
                  legend.position = "right",
                  plot.margin = margin(1, 1, 1, 1, 'cm'),
                  legend.key.height = unit(1, "cm"), legend.key.width = unit(0.2, "cm"))
palette6 <- c("#eff3ff","#bdd7e7","#6baed6","#3182bd","#08519c","909580")
palette5 <- c("#eff3ff","#bdd7e7","#6baed6","#3182bd","#08519c")
palette4 <- c("#D2FBD4","#92BCAB","#527D82","#123F5A")
palette2 <- c("#6baed6","#08519c")

# Install Census API Key
tidycensus::census_api_key("16325a5baedf1c986a182c3321f90022be22845a", overwrite = TRUE, install = TRUE)

2.2. Data Wrangling

I read in the indego trip data of July, 2018 using read.socrata. Also, I filter out the station id of 3000, which means unknown station. The weather in Philadelphia in July has a pleasant temperature range, so we may see some leisure trips as well as commutes.

indego <- read.csv("/Users/teresachang/Downloads/indego-trips-2018-q3.csv")
indego <- 
  filter(indego,indego$end_station != 3000)

I use date parsing to bin the data by 15 and 60 minute intervals by rounding. I extract the week of the observation, ranging from 1-52 throughout the year, and the dotw for day of the week. And I only keep the data of the week 26 to 31 in this report.

indego <- 
  indego %>%
  mutate(interval60 = floor_date(ymd_hms(start_time), unit = "hour"),
         interval15 = floor_date(ymd_hms(start_time), unit = "15 mins"),
         week = week(interval60),
         dotw = wday(interval60, label=TRUE))
indego <- 
  indego %>%
  filter(.,indego$week < 32)

#glimpse(indego)

2.3. Import Census and Amenity Info

Using the tidycensus package, we can download census geography and variables. I extract the tracts for mapping and joining purposes.

phillyCensus <- 
  get_acs(geography = "tract", 
          variables = c("B01003_001", "B19013_001", 
                        "B02001_002", "B08013_001",
                        "B08012_001", "B08301_001", 
                        "B08301_010", "B01002_001"), 
          year = 2018, 
          state = "PA", 
          geometry = TRUE, 
          county=c("Philadelphia"),
          output = "wide") %>%
  rename(Total_Pop =  B01003_001E,
         Med_Inc = B19013_001E,
         Med_Age = B01002_001E,
         White_Pop = B02001_002E,
         Travel_Time = B08013_001E,
         Num_Commuters = B08012_001E,
         Means_of_Transport = B08301_001E,
         Total_Public_Trans = B08301_010E) %>%
  select(Total_Pop, Med_Inc, White_Pop, Travel_Time,
         Means_of_Transport, Total_Public_Trans,
         Med_Age,
         GEOID, geometry) %>%
  mutate(Percent_White = White_Pop / Total_Pop,
         Mean_Commute_Time = Travel_Time / Total_Public_Trans,
         Percent_Taking_Public_Trans = Total_Public_Trans / Means_of_Transport)

phillyTracts <- 
  phillyCensus %>%
  as.data.frame() %>%
  distinct(GEOID, .keep_all = TRUE) %>%
  select(GEOID, geometry) %>% 
  st_sf

I add amenity features of Septa bus stops, schools, park and recreation center, and sanitation convenience centers obtained from OpenDataPhilly https://www.opendataphilly.org/dataset. Then, I use the method of Nearest neighbor to add the features for Septa bus stops, schools, park and recreation center, and sanitation convenience centers as predictors to be put into the regression models.

busstops <- read_csv("/Users/teresachang/Downloads/shelter_locations.csv")%>%
          st_as_sf(., coords = c("LONG.", "LAT."), crs = 4326)

## Rows: 297 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): STREET, INTERSECTI, ADDRESS
## dbl (2): LONG., LAT.
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

schools <- st_read("https://opendata.arcgis.com/datasets/d46a7e59e2c246c891fbee778759717e_0.geojson") %>%
          st_set_crs('4326')

## Reading layer `Schools' from data source 
##   `https://opendata.arcgis.com/datasets/d46a7e59e2c246c891fbee778759717e_0.geojson' 
##   using driver `GeoJSON'
## Simple feature collection with 490 features and 15 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -75.26633 ymin: 39.90781 xmax: -74.97053 ymax: 40.12974
## Geodetic CRS:  WGS 84

PPR <- st_read("/Users/teresachang/Downloads/PPR_Program_Sites.geojson")%>%
          st_set_crs('4326')

## Reading layer `PPR_Program_Sites' from data source 
##   `/Users/teresachang/Downloads/PPR_Program_Sites.geojson' using driver `GeoJSON'
## Simple feature collection with 224 features and 10 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -75.26015 ymin: 39.89718 xmax: -74.96944 ymax: 40.12284
## Geodetic CRS:  WGS 84

sanitation <- st_read("https://opendata.arcgis.com/datasets/129e346cc7654079980860b0656587a5_0.geojson") %>%
          st_set_crs('4326')

## Reading layer `Sanitation_Convenience_Centers' from data source 
##   `https://opendata.arcgis.com/datasets/129e346cc7654079980860b0656587a5_0.geojson' 
##   using driver `GeoJSON'
## Simple feature collection with 6 features and 4 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -75.24008 ymin: 39.91952 xmax: -75.01131 ymax: 40.03858
## Geodetic CRS:  WGS 84

startstop <-
  indego %>%
          filter(is.na(start_lon) == FALSE &
                 is.na(start_lat) == FALSE &
                 is.na(end_lon) == FALSE &
                 is.na(end_lat) == FALSE) %>%
          st_as_sf(., coords = c("start_lon", "start_lat"), crs = 4326)


indego <-
  indego %>%
  filter(is.na(start_lon) == FALSE &
         is.na(start_lat) == FALSE &
         is.na(end_lon) == FALSE &
         is.na(end_lat) == FALSE) %>%
  mutate(
      schools_nn3 = nn_function(na.omit(st_coordinates(startstop)), na.omit(st_coordinates(schools)), 3),
      busstops_nn3 = nn_function(na.omit(st_coordinates(startstop)), na.omit(st_coordinates(busstops)), 3),
      PPR_nn3 = nn_function(na.omit(st_coordinates(startstop)), na.omit(st_coordinates(PPR)), 3),
      sanitation_nn3 = nn_function(na.omit(st_coordinates(startstop)), na.omit(st_coordinates(sanitation)), 3)
      )

I add the spatial information to our bike share data as origin and destination data, first joining the origin station, then the destination station to our census data. However, I don’t use the destination data in this report.

bike_census <- st_join(indego %>% 
          filter(is.na(start_lon) == FALSE &
                   is.na(start_lat) == FALSE &
                   is.na(end_lon) == FALSE &
                   is.na(end_lat) == FALSE) %>%
          st_as_sf(., coords = c("start_lon", "start_lat"), crs = 4326),
        phillyTracts %>%
          st_transform(crs=4326),
        join=st_intersects,
              left = TRUE) %>%
  rename(Origin.Tract = GEOID) %>%
  mutate(start_lon = unlist(map(geometry, 1)),
         start_lat = unlist(map(geometry, 2)))%>%
  as.data.frame() %>%
  dplyr::select(-geometry)%>%
  st_as_sf(., coords = c("end_lon", "end_lat"), crs = 4326) %>%
  st_join(., phillyTracts %>%
            st_transform(crs=4326),
          join=st_intersects,
          left = TRUE) %>%
  rename(Destination.Tract = GEOID)  %>%
  mutate(end_lon = unlist(map(geometry, 1)),
         end_lat = unlist(map(geometry, 2)))%>%
  as.data.frame() %>%
  dplyr::select(-geometry)

2.4. Import Weather Data

Import weather data from Philadelphia International Airport (code KPHL) using riem_measuresfrom 2018/7/1 to 2018/8/5, and get temperature, wind speed, precipitation on an hourly basis and plot the temperature and precipitation trends over our study period. I assume that while raining days and high temperature days, people tend to not ride bikes.

weather.Panel <- 
  riem_measures(station = "KPHL", date_start = "2018-07-01", date_end = "2018-08-05") %>%
  dplyr::select(valid, tmpf, p01i, sknt)%>%
  replace(is.na(.), 0) %>%
    mutate(interval60 = ymd_h(substr(valid,1,13))) %>%
    mutate(week = week(interval60),
           dotw = wday(interval60, label=TRUE)) %>%
    group_by(interval60) %>%
    summarize(Temperature = max(tmpf),
              Precipitation = sum(p01i),
              Wind_Speed = max(sknt)) %>%
    mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))

#glimpse(weather.Panel)

grid.arrange(
  ggplot(weather.Panel, aes(interval60,Precipitation)) + geom_line() + 
  labs(title="Percipitation", x="Hour", y="Perecipitation") + plotTheme,
  ggplot(weather.Panel, aes(interval60,Wind_Speed)) + geom_line() + 
    labs(title="Wind Speed", x="Hour", y="Wind Speed") + plotTheme,
  ggplot(weather.Panel, aes(interval60,Temperature)) + geom_line() + 
    labs(title="Temperature", x="Hour", y="Temperature") + plotTheme,
  top="Weather Data - Philadelphia KPHL - 7/1 to 8/5, 2018")

3.1. Explore Analysis

I examine the time and frequency components of our data. First, we look at the overall time pattern. Note that it is Independence Day on 7/4, which means that holidays may influence the normal pattern in weekday.

ggplot(bike_census %>%
         group_by(interval60) %>%
         tally())+
  geom_line(aes(x = interval60, y = n))+
  labs(title="Bike share trips per hr. Philadelphia, 7/1 to 8/5, 2018",
       x="Date", 
       y="Number of trips")+
  plotTheme

I examine the distribution of trip volume by station for different times of the day. The mean number of trips each station shows different patterns during am rush hours, pm rush hours, mid-day, and overnight. It is obvious that mid-day has the highest frequency in 2 trips overall. Also, PM ride has the largest number of rides by 8, which means people are more likely to ride a bike in the pm rush hours.

bike_census %>%
        mutate(time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
         group_by(interval60, start_station, time_of_day) %>%
         tally()%>%
  group_by(start_station, time_of_day)%>%
  summarize(mean_trips = mean(n))%>%
  ggplot()+
  geom_histogram(aes(mean_trips), binwidth = 1)+
  labs(title="Mean Number of Hourly Trips Per Station. Philadelphia, 7/1 to 8/5, 2018",
       x="Number of trips", 
       y="Frequency")+
  facet_wrap(~time_of_day)+
  plotTheme

Histogram of bike share trips per hour by station is made. We can see that sost stations have less than 20 bike rides per hour.

ggplot(bike_census %>%
         group_by(interval60, start_station) %>%
         tally())+
  geom_histogram(aes(n), binwidth = 5)+
  labs(title="Bike share trips per hr by station. Philadelphia, 7/1 to 8/5, 2018",
       x="Trip Counts", 
       y="Number of Stations")+
  plotTheme

The plot of bike share trips distribution with time, sorted by weekdays and weekends is made. It can be observe that the peaks occur in am and pm rush hours in weekdays. However, it shows only one peak in weekends, which is around 3 pm when many people might go out in the afternoon for some activities.

ggplot(bike_census %>% mutate(hour = hour(start_time)))+
     geom_freqpoly(aes(hour, color = dotw), binwidth = 1)+
  labs(title="Bike share trips in Philadelphia, by day of the week, 7/1 to 8/5, 2018",
       x="Hour", 
       y="Trip Counts")+
     plotTheme

ggplot(bike_census %>% 
         mutate(hour = hour(start_time),
                weekend = ifelse(dotw %in% c("日", "六"), "Weekend", "Weekday")))+
     geom_freqpoly(aes(hour, color = weekend), binwidth = 1)+
  labs(title="Bike share trips in Philadelphia - weekend vs weekday, 7/1 to 8/5, 2018",
       x="Hour", 
       y="Trip Counts")+
     plotTheme

The origin map for the bike trips in Philadelphia is made below. The maps are sorted by weekdays/weekends and different time categories of the day. It is obvious that more trips happened in weekdays than weekends. Also, there are more trips in pm rush hours. Besides, trips mostly happens in the center city of Philadelphia.

ggplot()+
  geom_sf(data = phillyTracts %>%
          st_transform(crs=4326))+
  geom_point(data = bike_census %>% 
            mutate(hour = hour(start_time),
                weekend = ifelse(dotw %in% c("日", "六"), "Weekend", "Weekday"),
                time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
              group_by(start_station, start_lat, start_lon, weekend, time_of_day) %>%
              tally(),
            aes(x=start_lon, y = start_lat, color = n), 
            fill = "transparent", alpha = 0.4, size = 0.4)+
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "D")+
  ylim(min(bike_census$start_lat), max(bike_census$start_lat))+
  xlim(min(bike_census$start_lon), max(bike_census$start_lon))+
  facet_grid(weekend ~ time_of_day)+
  labs(title="Bike share trips per hr by station. Philadelphia, 7/1 to 8/5, 2018")+
  mapTheme

3.2 Create Space-Time Panel

A space-time panel is created for future regression model. First, I make sure each unique station and hour/day combo exists in our data set, which is done in order to create a panel data set where each time period in the study is represented by a row. If a station didn’t have any trips originating from it at a given hour, I add a zero in that spot in the panel. Besides, all other features needed for model training are also involved in the space-time panel.

#length(unique(bike_census$interval60)) * length(unique(bike_census$start_station))


study.panel <- 
  expand.grid(interval60=unique(bike_census$interval60), 
              start_station = unique(bike_census$start_station)) %>%
  left_join(., bike_census %>%
              select(start_station, Origin.Tract, start_lon, start_lat, busstops_nn3, schools_nn3, PPR_nn3,       
                     sanitation_nn3)%>%
              distinct() %>%
              group_by(start_station) %>%
              slice(1))

#nrow(study.panel)

ride.panel <- 
  bike_census %>%
  mutate(Trip_Counter = 1) %>%
  right_join(study.panel) %>% 
  group_by(interval60, start_station, Origin.Tract, start_lon, start_lat, busstops_nn3, schools_nn3, PPR_nn3,       
                     sanitation_nn3) %>%
  summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
  left_join(weather.Panel) %>%
  ungroup() %>%
  filter(is.na(start_station) == FALSE) %>%
  mutate(week = week(interval60),
         dotw = wday(interval60, label = TRUE)) %>%
  filter(is.na(Origin.Tract) == FALSE)

ride.panel <- 
  left_join(ride.panel, phillyCensus %>%
              as.data.frame() %>%
              select(-geometry), by = c("Origin.Tract" = "GEOID"))

3.3. Create time lags

I add additional nuance about the demand during a given time period by creating time lag variables. Time lags of 1 hour, 2 hours, 3 hours, 4 hours, 12 hours, and 1 day are created in the panel. Also, I try to control for the effects of holidays that disrupt the expected demand during a given weekend or weekday. There is a holiday on July 4th, Independence Day. For that five day weekend I use some variables indicating temporal proximity to the holiday.

ride.panel <- 
  ride.panel %>% 
  arrange(start_station, interval60) %>% 
  mutate(lagHour = dplyr::lag(Trip_Count,1),
         lag2Hours = dplyr::lag(Trip_Count,2),
         lag3Hours = dplyr::lag(Trip_Count,3),
         lag4Hours = dplyr::lag(Trip_Count,4),
         lag12Hours = dplyr::lag(Trip_Count,12),
         lag1day = dplyr::lag(Trip_Count,24),
         holiday = ifelse(yday(interval60) == 185,1,0)) %>%
   mutate(day = yday(interval60)) %>%
   mutate(holidayLag = case_when(dplyr::lag(holiday, 1) == 1 ~ "PlusOneDay",
                                 dplyr::lag(holiday, 2) == 1 ~ "PlustTwoDays",
                                 dplyr::lag(holiday, 3) == 1 ~ "PlustThreeDays",
                                 dplyr::lead(holiday, 1) == 1 ~ "MinusOneDay",
                                 dplyr::lead(holiday, 2) == 1 ~ "MinusTwoDays",
                                 dplyr::lead(holiday, 3) == 1 ~ "MinusThreeDays"),
         holidayLag = ifelse(is.na(holidayLag) == TRUE, 0, holidayLag))

The table provided below is the correlations in these lags. We can see that 1 hour lag and 1 day lag have the top 2 largest positive correlation with the bike share data while the 12 hour lag has the largest negative correlation. This means the bike share demand has a high similarity with the near future and the next day. Also, it is totally different in the same station after 12 hours since people user scenario and behavior may change a lot after 12 hours.

as.data.frame(ride.panel) %>%
    group_by(interval60) %>% 
    summarise_at(vars(starts_with("lag"), "Trip_Count"), mean, na.rm = TRUE) %>%
    gather(Variable, Value, -interval60, -Trip_Count) %>%
    mutate(Variable = factor(Variable, levels=c("lagHour","lag2Hours","lag3Hours","lag4Hours",
                                                "lag12Hours","lag1day","holidayLag")))%>%
    group_by(Variable) %>%  
    summarize(correlation = round(cor(Value, Trip_Count),2))

## # A tibble: 6 × 2
##   Variable   correlation
##   <fct>            <dbl>
## 1 lagHour           0.81
## 2 lag2Hours         0.55
## 3 lag3Hours         0.33
## 4 lag4Hours         0.14
## 5 lag12Hours       -0.35
## 6 lag1day           0.76

3.4 Animation Map

To better illustrate the change of numbers of bike trips each station with time, an animated map is generated. Trip count is of the 60-minute intervals, and a whole week of week 27 is put into the map.

stations<-
  indego %>% 
  dplyr::select(start_station, start_lon, start_lat) %>%
  group_by(start_station) %>%
  distinct()%>%
  filter(is.na(start_lon) == FALSE &
           is.na(start_lat) == FALSE) %>%
  st_as_sf(., coords = c("start_lon", "start_lat"), crs = 4326)

week26 <-
  filter(bike_census , week == 27)

week26.panel <-
  expand.grid(
    interval60 = unique(week26$interval60),
    Origin.Tract = unique(ride.panel$Origin.Tract))

bike.animation.data <-
  mutate(week26, Trip_Counter = 1) %>%
  right_join(week26) %>% 
  group_by(interval60, start_station) %>%
  summarise(Trip_Count = sum(Trip_Counter, na.rm=T)) %>% 
  ungroup() %>% 
  left_join(stations, by=c("start_station" = "start_station")) %>%
  st_sf() %>%
  mutate(Trips = case_when(Trip_Count == 0 ~ "0 trips",
                           Trip_Count == 1 ~ "1 trips",
                           Trip_Count == 2 ~ "2 trips",
                           Trip_Count > 2 & Trip_Count <= 4 ~ "3-4 trips",
                           Trip_Count > 4 ~ "5+ trips")) %>%
  mutate(Trips  = fct_relevel(Trips, "0 trips","1 trips","2 trips",
                              "3-4 trips","5+ trips"))

## Joining, by = c("trip_id", "duration", "start_time", "end_time",
## "start_station", "end_station", "bike_id", "plan_duration",
## "trip_route_category", "passholder_type", "bike_type", "interval60",
## "interval15", "week", "dotw", "schools_nn3", "busstops_nn3", "PPR_nn3",
## "sanitation_nn3", "Origin.Tract", "start_lon", "start_lat",
## "Destination.Tract", "end_lon", "end_lat")
## `summarise()` has grouped output by 'interval60'. You can override using the
## `.groups` argument.

## Warning: 1 unknown level in `f`: 0 trips

bike_animation <-
  ggplot() +
  geom_sf(data = phillyTracts %>%
            st_transform(crs=4326))+
  geom_sf(data = bike.animation.data, 
             aes(color = Trips), size = 2) +
  scale_color_manual(values = palette4) +
  ylim(min(bike_census$start_lat), max(bike_census$start_lat))+
  xlim(min(bike_census$start_lon), max(bike_census$start_lon))+
  labs(title = "Bike Trips by Start Stations for Week 26, 2018",
       subtitle = "60-minute intervals: {current_frame}") +
  transition_manual(interval60)

animate(bike_animation, duration=50, renderer = gifski_renderer())

4.1. Run Models

I split our data into a training set by 3 weeks and a test set by 2 weeks. I create six linear models or predicting bike rides in Philadelphia.

Model 1: time interval, day of the week, temperature

Model 2: start station, day of the week, temperature

Model 3: start station, time interval, day of the week, temperature, precipitation

Model 4: start station, time interval, day of the week, temperature, precipitation, time lags

Model 5: start station, time interval, day of the week, temperature, precipitation, time lags, holiday lags

Model 6: start station, time interval, day of the week, temperature, precipitation, time lags, holiday lags, amenity features

ride.Train <- filter(ride.panel, week >= 30)
ride.Test <- filter(ride.panel, week < 30)

reg1 <- 
  lm(Trip_Count ~  hour(interval60) + dotw + Temperature,  data=ride.Train)

reg2 <- 
  lm(Trip_Count ~  start_station + dotw + Temperature,  data=ride.Train)

reg3 <- 
  lm(Trip_Count ~  start_station + hour(interval60) + dotw + Temperature + Precipitation, 
     data=ride.Train)

reg4 <- 
  lm(Trip_Count ~  start_station +  hour(interval60) + dotw + Temperature + Precipitation +
                   lagHour + lag2Hours +lag3Hours + lag12Hours + lag1day, 
     data=ride.Train)

reg5 <- 
  lm(Trip_Count ~  start_station + hour(interval60) + dotw + Temperature + Precipitation +
                   lagHour + lag2Hours +lag3Hours +lag12Hours + lag1day  + holiday, 
     data=ride.Train)

reg6 <- 
  lm(Trip_Count ~  start_station + hour(interval60) + dotw + Temperature + Precipitation +
                   lagHour + lag2Hours +lag3Hours +lag12Hours + lag1day  + holiday + 
                   schools_nn3 + busstops_nn3 + PPR_nn3 + sanitation_nn3, 
     data=ride.Train)

4.2. Predict for test data

The predictions using the 5 models above generate the results are shown in the table below.

ride.Test.weekNest <- 
  ride.Test %>%
  nest(-week)

model_pred <- function(dat, fit){
   pred <- predict(fit, newdata = dat)}

week_predictions <- 
  ride.Test.weekNest %>% 
    mutate(ATime_FE = map(.x = data, fit = reg1, .f = model_pred),
           BSpace_FE = map(.x = data, fit = reg2, .f = model_pred),
           CTime_Space_FE = map(.x = data, fit = reg3, .f = model_pred),
           DTime_Space_FE_timeLags = map(.x = data, fit = reg4, .f = model_pred),
           ETime_Space_FE_timeLags_holidayLags = map(.x = data, fit = reg5, .f = model_pred),
           FTime_Space_FE_timeLags_amenity = map(.x = data, fit = reg6, .f = model_pred)) %>% 
    gather(Regression, Prediction, -data, -week) %>%
    mutate(Observed = map(data, pull, Trip_Count),
           Absolute_Error = map2(Observed, Prediction,  ~ abs(.x - .y)),
           MAE = map_dbl(Absolute_Error, mean, na.rm = TRUE),
           sd_AE = map_dbl(Absolute_Error, sd, na.rm = TRUE))

## Warning in predict.lm(fit, newdata = dat): 用秩缺乏擬合來進行預測的結果很可能不
## 可靠

## Warning in predict.lm(fit, newdata = dat): 用秩缺乏擬合來進行預測的結果很可能不
## 可靠

## Warning in predict.lm(fit, newdata = dat): 用秩缺乏擬合來進行預測的結果很可能不
## 可靠

## Warning in predict.lm(fit, newdata = dat): 用秩缺乏擬合來進行預測的結果很可能不
## 可靠

## Warning in predict.lm(fit, newdata = dat): 用秩缺乏擬合來進行預測的結果很可能不
## 可靠

## Warning in predict.lm(fit, newdata = dat): 用秩缺乏擬合來進行預測的結果很可能不
## 可靠

## Warning in predict.lm(fit, newdata = dat): 用秩缺乏擬合來進行預測的結果很可能不
## 可靠

## Warning in predict.lm(fit, newdata = dat): 用秩缺乏擬合來進行預測的結果很可能不
## 可靠

week_predictions

## # A tibble: 24 × 8
##     week data                   Regression   Predi…¹ Obser…² Absol…³   MAE sd_AE
##    <dbl> <list>                 <chr>        <list>  <list>  <list>  <dbl> <dbl>
##  1    26 <tibble [3,072 × 33]>  ATime_FE     <dbl>   <dbl>   <dbl>   0.927 0.815
##  2    27 <tibble [21,376 × 33]> ATime_FE     <dbl>   <dbl>   <dbl>   0.982 1.08 
##  3    28 <tibble [21,504 × 33]> ATime_FE     <dbl>   <dbl>   <dbl>   1.05  1.28 
##  4    29 <tibble [21,504 × 33]> ATime_FE     <dbl>   <dbl>   <dbl>   1.04  1.24 
##  5    26 <tibble [3,072 × 33]>  BSpace_FE    <dbl>   <dbl>   <dbl>   1.10  0.767
##  6    27 <tibble [21,376 × 33]> BSpace_FE    <dbl>   <dbl>   <dbl>   1.04  1.08 
##  7    28 <tibble [21,504 × 33]> BSpace_FE    <dbl>   <dbl>   <dbl>   1.08  1.26 
##  8    29 <tibble [21,504 × 33]> BSpace_FE    <dbl>   <dbl>   <dbl>   1.04  1.24 
##  9    26 <tibble [3,072 × 33]>  CTime_Space… <dbl>   <dbl>   <dbl>   0.921 0.818
## 10    27 <tibble [21,376 × 33]> CTime_Space… <dbl>   <dbl>   <dbl>   0.967 1.08 
## # … with 14 more rows, and abbreviated variable names ¹Prediction, ²Observed,
## #   ³Absolute_Error

5.1. Examine Error Metrics for Accuracy

The best models, the lag and amenity models, are accurate to less than an average of 0.8 ride per hour, which is pretty alright for overall accuracy. As the MAE results shown in the bar plot below, model 4, 5, and 6 have the similarly best performance in accurately predicting the bike rides. This shows that time lags and amenity features are really essential for bike share prediction, better than time, space, and weather features. Model 6 is a bit better that Model 4 and 5, which means that amenity features are effective to the prediction.

week_predictions %>%
  dplyr::select(week, Regression, MAE) %>%
  gather(Variable, MAE, -Regression, -week) %>%
  ggplot(aes(week, MAE)) + 
    geom_bar(aes(fill = Regression), position = "dodge", stat="identity") +
    scale_fill_manual(values = palette6) +
    labs(title = "Mean Absolute Errors by model specification and week") +
  plotTheme

The plot below shows the time series of predicted and observed bike trips. The prediction of model 4, 5 and 6 fit well with the observed bike trips.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station = map(data, pull, start_station)) %>%
    dplyr::select(interval60, start_station, Observed, Prediction, Regression) %>%
    unnest() %>%
    gather(Variable, Value, -Regression, -interval60, -start_station) %>%
    group_by(Regression, Variable, interval60) %>%
    summarize(Value = sum(Value)) %>%
    ggplot(aes(interval60, Value, colour=Variable)) + 
      geom_line(size = 1.1) + 
      facet_wrap(~Regression, ncol=1) +
      labs(title = "Predicted/Observed bike share time series", subtitle = "Philadelphia; A test set of 2 weeks",  x = "Hour", y= "Station Trips") +
      plotTheme

The spatial distribution of MAEs is presented below. Higher MAEs concentrate in the center city of Philadelphia. Hence, the prediction is better in the surrounding bike stations. Besides, a relatively worse spatial generalizability in this model due to the difference in spatial distribution of MAE.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station = map(data, pull, start_station), 
           start_lat = map(data, pull, start_lat), 
           start_lon = map(data, pull, start_lon)) %>%
    select(interval60, start_station, start_lon, start_lat, Observed, Prediction, Regression) %>%
    unnest() %>%
  filter(Regression == "FTime_Space_FE_timeLags_amenity") %>%
  group_by(start_station, start_lon, start_lat) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.)+
  geom_sf(data = phillyCensus, color = "grey", fill = "transparent")+
  geom_point(aes(x = start_lon, y = start_lat, color = MAE), 
             fill = "transparent", alpha = 0.4)+
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "D")+
  ylim(min(bike_census$start_lat), max(bike_census$start_lat))+
  xlim(min(bike_census$start_lon), max(bike_census$start_lon))+
  labs(title="Mean Abs Error, Test Set, Model 6")+
  mapTheme

5.2. Cross Validation

I use the cross validation for 6 models to test the generalizability. I use k-fold cross validation algorithm in this report. Cross-validation ensures that the goodness-of-fit results for a single hold is not by chance.

fitControl <- trainControl(method = "cv", number = 100)
set.seed(825)

reg.cv1 <- 
  train(Trip_Count ~ ., data = ride.Train %>% 
          dplyr::select(Trip_Count, 
                        interval60 , dotw ,  Temperature), 
        method = "lm", trControl = fitControl, na.action = na.pass)

reg.cv1

## Linear Regression 
## 
## 43008 samples
##     3 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 42578, 42578, 42578, 42578, 42578, 42578, ... 
## Resampling results:
## 
##   RMSE      Rsquared    MAE     
##   1.590366  0.01880733  1.040999
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

reg.cv1$resample[1:5,]

##       RMSE    Rsquared      MAE Resample
## 1 1.483150 0.006809894 1.038448  Fold001
## 2 1.619906 0.005013852 1.040655  Fold002
## 3 1.895653 0.030796279 1.135697  Fold003
## 4 1.793673 0.021019432 1.070515  Fold004
## 5 1.444081 0.007530939 0.998659  Fold005

reg.cv2 <- 
  train(Trip_Count ~ ., data = ride.Train %>% 
          dplyr::select(Trip_Count, 
                        start_station , dotw ,  Temperature), 
        method = "lm", trControl = fitControl, na.action = na.pass)

reg.cv2

## Linear Regression 
## 
## 43008 samples
##     3 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 42578, 42578, 42578, 42578, 42578, 42577, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   1.576475  0.0347273  1.028526
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

reg.cv2$resample[1:5,]

##       RMSE   Rsquared       MAE Resample
## 1 1.393173 0.06217179 0.9655907  Fold001
## 2 2.272312 0.04695982 1.1466700  Fold002
## 3 1.734636 0.03392971 1.1030553  Fold003
## 4 1.418810 0.04592763 1.0170050  Fold004
## 5 1.548500 0.02947415 1.0369431  Fold005

reg.cv3 <- 
  train(Trip_Count ~ ., data = ride.Train %>% 
          dplyr::select(Trip_Count, 
                        start_station ,interval60, dotw ,  Temperature, Precipitation), 
        method = "lm", trControl = fitControl, na.action = na.pass)

reg.cv3

## Linear Regression 
## 
## 43008 samples
##     5 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 42578, 42578, 42578, 42578, 42577, 42578, ... 
## Resampling results:
## 
##   RMSE      Rsquared    MAE     
##   1.576498  0.03681374  1.026565
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

reg.cv3$resample[1:5,]

##       RMSE   Rsquared       MAE Resample
## 1 1.445005 0.03762498 0.9913639  Fold001
## 2 1.366832 0.05581451 0.9642802  Fold002
## 3 1.770422 0.02731340 1.0226374  Fold003
## 4 1.707593 0.04964061 0.9863379  Fold004
## 5 1.278935 0.05396673 0.9271464  Fold005

reg.cv4 <- 
  train(Trip_Count ~ ., data = ride.Train %>% 
          dplyr::select(Trip_Count, 
                        start_station ,interval60, dotw, Temperature, Precipitation, 
                        lagHour, lag2Hours, lag3Hours, lag12Hours, lag1day), 
        method = "lm", trControl = fitControl, na.action = na.pass)

reg.cv4

## Linear Regression 
## 
## 43008 samples
##    10 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 42578, 42578, 42578, 42577, 42578, 42578, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE      
##   1.304854  0.3409077  0.8152613
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

reg.cv4$resample[1:5,]

##       RMSE  Rsquared       MAE Resample
## 1 1.314264 0.3048313 0.8184091  Fold001
## 2 1.364079 0.3859482 0.8411141  Fold002
## 3 1.264572 0.2101985 0.8295053  Fold003
## 4 1.500862 0.3729488 0.8794841  Fold004
## 5 1.305940 0.4745198 0.8579033  Fold005

reg.cv5 <- 
  train(Trip_Count ~ ., data = ride.Train %>% 
          dplyr::select(Trip_Count, 
                        start_station ,interval60, dotw, Temperature, Precipitation, 
                        lagHour, lag2Hours, lag3Hours, lag12Hours, lag1day, holiday), 
        method = "lm", trControl = fitControl, na.action = na.pass)

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

reg.cv5

## Linear Regression 
## 
## 43008 samples
##    11 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 42578, 42577, 42578, 42578, 42578, 42578, ... 
## Resampling results:
## 
##   RMSE      Rsquared  MAE      
##   1.303951  0.343506  0.8152864
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

reg.cv5$resample[1:5,]

##       RMSE  Rsquared       MAE Resample
## 1 1.218829 0.4977750 0.7729427  Fold001
## 2 1.407045 0.2565754 0.8717411  Fold002
## 3 1.176218 0.3863221 0.7645609  Fold003
## 4 1.454991 0.2274118 0.8556157  Fold004
## 5 1.347985 0.2536958 0.8756964  Fold005

reg.cv6 <- 
  train(Trip_Count ~ ., data = ride.Train %>% 
          dplyr::select(Trip_Count, 
                        start_station ,interval60, dotw, Temperature, Precipitation, 
                        lagHour, lag2Hours, lag3Hours, lag12Hours, lag1day, holiday, schools_nn3, busstops_nn3, sanitation_nn3, PPR_nn3), 
        method = "lm", trControl = fitControl, na.action = na.pass)

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

## Warning in predict.lm(modelFit, newdata): 用秩缺乏擬合來進行預測的結果很可能不可
## 靠

reg.cv6

## Linear Regression 
## 
## 43008 samples
##    15 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 42578, 42578, 42578, 42578, 42577, 42579, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE      
##   1.304107  0.3424148  0.8137612
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

reg.cv6$resample[1:5,]

##       RMSE  Rsquared       MAE Resample
## 1 1.335843 0.3969327 0.8253428  Fold001
## 2 1.340108 0.3019894 0.8412230  Fold002
## 3 1.217312 0.2789517 0.7766859  Fold003
## 4 1.409614 0.4322938 0.8768652  Fold004
## 5 1.182172 0.3110342 0.8205507  Fold005

The distribution of MAE of model 5 cluster quite tightly together, which mostly lie between 0.77 and 0.82, and this suggest that model 5 have a good generalizability.

MAE5 <- reg.cv5$resample
ggplot(MAE5, aes(x=MAE)) + 
  geom_histogram(color="black", fill="white") +
  xlim(0, 1) +
  labs(title = "Histogram of MAE of Cross Validation", subtitle = "Model:ETime_Space_FE_timeLags_holidayLags",  x = "MAE", y= "Count")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 2 rows containing missing values (geom_bar).

5.3. Space-Time Error Evaluation

The scatter plot of observed and predicted bike trips is presented below. We can see that the red line of predicted results is under the black line of perfect model. This means that there is no significant difference in model errors in time.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station = map(data, pull, start_station), 
           start_lat = map(data, pull, start_lat), 
           start_lon = map(data, pull, start_lon),
           dotw = map(data, pull, dotw)) %>%
    select(interval60, start_station, start_lon, 
           start_lat, Observed, Prediction, Regression,
           dotw) %>%
    unnest() %>%
  filter(Regression == "FTime_Space_FE_timeLags_amenity")%>%
  mutate(weekend = ifelse(dotw %in% c("日", "六"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
  ggplot()+
  geom_point(aes(x= Observed, y = Prediction))+
    geom_smooth(aes(x= Observed, y= Prediction), method = "lm", se = FALSE, color = "red")+
    geom_abline(slope = 1, intercept = 0)+
  facet_grid(time_of_day~weekend)+
  labs(title="Observed vs Predicted",
       x="Observed trips", 
       y="Predicted trips")+
  plotTheme

We can see that there is a huge difference in the distribution of MAE during different time. Specifically, on weekdays, there are higher MAE stations in the center city during am , mid-day and pm rush hours, while the MAEs is quite similar in overnight periods. On weekends, higher MAE stations occur in the center city during mid-day and pm rush hours, while the MAEs is quite similar during am rush hours and overnight periods.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station = map(data, pull, start_station), 
           start_lat = map(data, pull, start_lat), 
           start_lon = map(data, pull, start_lon),
           dotw = map(data, pull, dotw) ) %>%
    select(interval60, start_station, start_lon, 
           start_lat, Observed, Prediction, Regression,
           dotw) %>%
    unnest() %>%
  filter(Regression == "FTime_Space_FE_timeLags_amenity")%>%
  mutate(weekend = ifelse(dotw %in% c("日", "六"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  group_by(start_station, weekend, time_of_day, start_lon, start_lat) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  ggplot(.)+
  geom_sf(data = phillyCensus, color = "grey", fill = "transparent")+
  geom_point(aes(x = start_lon, y = start_lat, color = MAE), 
             fill = "transparent", size = 0.5, alpha = 0.4)+
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "D")+
  ylim(min(bike_census$start_lat), max(bike_census$start_lat))+
  xlim(min(bike_census$start_lon), max(bike_census$start_lon))+
  facet_grid(weekend~time_of_day)+
  labs(title="Mean Absolute Errors, Test Set")+
  mapTheme

Lastly, I focus on the morning commute, where station locations probably relate to likely users, who seem to be commuting to center city. There are tends that as median household income or percentage of white goes up, the MAE goes up. Also, as percentage of taking public transit goes up, the MAE goes down.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station = map(data, pull, start_station), 
           start_lat = map(data, pull, start_lat), 
           start_lon = map(data, pull, start_lon),
           dotw = map(data, pull, dotw),
           Percent_Taking_Public_Trans = map(data, pull, Percent_Taking_Public_Trans),
           Med_Inc = map(data, pull, Med_Inc),
           Percent_White = map(data, pull, Percent_White)) %>%
    select(interval60, start_station, start_lon, 
           start_lat, Observed, Prediction, Regression,
           dotw, Percent_Taking_Public_Trans, Med_Inc, Percent_White) %>%
    unnest() %>%
  filter(Regression == "FTime_Space_FE_timeLags_amenity")%>%
  mutate(weekend = ifelse(dotw %in% c("日", "六"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  filter(time_of_day == "AM Rush") %>%
  group_by(start_station, Percent_Taking_Public_Trans, Med_Inc, Percent_White) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  gather(-start_station, -MAE, key = "variable", value = "value")%>%
  ggplot(.)+
  #geom_sf(data = phillyCensus, color = "grey", fill = "transparent")+
  geom_point(aes(x = value, y = MAE), alpha = 0.4)+
  geom_smooth(aes(x = value, y = MAE), method = "lm", se= FALSE)+
  facet_wrap(~variable, scales = "free")+
  labs(title="Errors as a function of socio-economic variables",
       y="Mean Absolute Error (Trips)")+
  plotTheme

6.1. Conclusions

In conclusion, the model 6 can be applied in the bike sharing re-balancing plan to better predict the share bike demand for Indego bike share system in Philadelphia. Model 6 includes features of time interval, day of the week, temperature, precipitation, time lags, holiday lags, and amenity features. First, it has an acceptable accuracy by approximately 0.8 of mean MAE, which means the error of predicting bike share demands is around 0.8 bike. It is acceptable because less than one bike demand might be predicted wrong in each station, which doesn’t impact a lot. Besides, the model 6 shows a quite good generalizability showed in cross validation. A good generalizability shows that it is quite reliable for us to use the model to predict the bike demand in the future uses. Lastly, the model 6 shows a great fit of observed data and predicted data showed in the plot presented above. Therefore, Model 6 can be usefull in predicting bike share demand for the future two weeks and help INdego to optimize and set up re-balancing bikes plans in advance.

There are still some limitation for the model. It was showed that the spatial distribution of the MAEs is quit uneven, which will make predictions in some areas not that accurate, specifically in Center City, Philadelphia. The model still can be improve by better its MAE’s spatial distribution and the accuracy since the riders can ride across areas and should have a similar experience in each station. Moreover, more features can be added to better our model since the amenity features for now can only improve slightly on the model.