import folium
import pandas as pd
import numpy as np
from datetime import datetime
from datetime import timedelta
import seaborn as sns
import matplotlib.pyplot as plt
Importing trip data for March 2016 and Weather data from Weather Underground.
The source data is read in a separate Python class. It is manipulated as a DataFrame and finally exported into Parquet (or Feather) format. Both parquet and feather are well supported but parquest has wider acceptance at this moment. I also noticed that parquet files compress better with snappy. To know more about Feather, read https://blog.rstudio.org/2016/03/29/feather/
The ETL part of the analysis is in a separate file (CitiBike_ETL.py). It reads all .csv files and converts it into parquet. We could also have partioned the parquet files based on the column we'll be joining with other datasets. But for this analysis, I have not partitioned as I'm analysing just a month's data. During analysis, this notebook will be run multiple times and it is efficient to just import the DataFrame
Importing trip data for March 2016 and Weather data from Weather Underground
# Read the data files
bikedata = pd.read_parquet("./data/bikedata_201603.parquet")
weather = pd.read_csv('./data/weather.csv')
Drop the unnamed column
weather.drop('Unnamed: 0', axis=1, inplace=True)
Data manipulation
Making both the keys in the dataframes of same type (<type 'datetime.date'>)
bikedata['dtstarttime'] = pd.to_datetime(bikedata.starttime, format="%m/%d/%Y %H:%M:%S")
bikedata['dtstoptime'] = pd.to_datetime(bikedata.stoptime, format="%m/%d/%Y %H:%M:%S")
bikedata['date'] = bikedata.dtstarttime.dt.date
weather['date'] = pd.to_datetime(weather.date, format="%Y-%m-%d").dt.date
Set the startdate and stopdate - minutes and seconds reset to 0 (in the following format - 2016-03-01 06:00:00). This has been done so that we could aggregate departures and arrivals per hour to identify bike usage
bikedata['dtstartdatehour'] = bikedata.dtstarttime.apply(lambda x:x.replace(minute=0,second=0))
bikedata['dtstopdatehour'] = bikedata.dtstoptime.apply(lambda x:x.replace(minute=0,second=0))
Display the data types of the data frame
bikedata.dtypes
tripduration int64
starttime object
stoptime object
start station id int64
start station name object
start station latitude float64
start station longitude float64
end station id int64
end station name object
end station latitude float64
end station longitude float64
bikeid int64
usertype object
birth year float64
male int64
female int64
dtstartdatehour datetime64[ns]
dtstarttime datetime64[ns]
dtstoptime datetime64[ns]
date object
dtstopdatehour datetime64[ns]
dtype: object
As we can see below in the graph, the busies times in the month of March are 8-10 am and 5-6 pm. It also seems that people prefer taking bikes in the evening after work
aggByHour = bikedata.groupby(by=bikedata.dtstartdatehour.dt.hour).count()['start station id']
sns.barplot(aggByHour.index, aggByHour.values)
<matplotlib.axes._subplots.AxesSubplot at 0x7faaa77a3cc0>
As you can see below, the busiest start and end stations are almost the same ones. We can deduce a few things from this.
- The bike company could employ more staff in these stations
- They could keep these stations full in the mornings and evenings
aggByStartSt = bikedata.groupby(by=['start station name']).count()['start station id'].sort_values(ascending=False).head(10)
aggByEndSt = bikedata.groupby(by=['end station name']).count()['end station id'].sort_values(ascending=False).head(10)
f, axes = plt.subplots(1,2,figsize=(12, 6))
g1 = sns.barplot(aggByStartSt.index, aggByStartSt.values, ax=axes[0])\
.set_xticklabels(aggByStartSt.index, rotation=90)
g2 = sns.barplot(aggByEndSt.index, aggByEndSt.values, ax=axes[1])\
.set_xticklabels(aggByEndSt.index, rotation=90)
# Merging weather and bike data
df = pd.merge(bikedata, weather, on='date', how='left')
df.dtypes
We can see that only 1.12% of the bikes start and end at the same station
float(len(bikedata[bikedata['start station id'] == bikedata['end station id']]))*100/len(bikedata)
1.1209185591134498
More data manipulation
Extract the Hour component from starttime and stoptime
#startHour = bikedata.starttime.str.split().str[1].str.split(':').str[0].astype('int')
#endHour = bikedata.stoptime.str.split().str[1].str.split(':').str[0].astype('int')
startHour = bikedata.dtstarttime.dt.hour
endHour = bikedata.dtstoptime.dt.hour
# All data where starttime is before 9 am
# dfStopBeforeTen = bikedata[endHour <= 9]
# All data where starttime is after 4 pm
# dfStartAfterFour = bikedata[startHour >= 16]
Getting all data for mornings (before 10 am) and evenings (after 4 pm)
dfMorning = bikedata[endHour <= 10]
dfEvening = bikedata[endHour >= 16]
dfMorningStart = dfMorning.groupby(by=['start station id'])['bikeid'].count().sort_values(ascending=False)
dfMorningEnd = dfMorning.groupby(by=['end station id'])['bikeid'].count().sort_values(ascending=False)
dfEveningStart = dfEvening.groupby(by=['start station id'])['bikeid'].count().sort_values(ascending=False)
dfEveningEnd = dfEvening.groupby(by=['end station id'])['bikeid'].count().sort_values(ascending=False)
topN = 750
# We are finding the top N start and end stations id before 10am
dfTopMorningStart = dfMorningStart[dfMorningStart > topN]
dfTopMorningEnd = dfMorningEnd[dfMorningEnd > topN]
# We are finding the top N start and end stations id ater 4pm
dfTopEveningStart = dfEveningStart[dfEveningStart > topN]
dfTopEveningEnd = dfEveningEnd[dfEveningEnd > topN]
netMorningSource = bikedata[bikedata['end station id'].isin(dfTopMorningStart.index.values)]
netMorningSink = bikedata[bikedata['end station id'].isin(dfTopMorningEnd.index.values)]
netEveningSource = bikedata[bikedata['end station id'].isin(dfTopEveningStart.index.values)]
netEveningSink = bikedata[bikedata['end station id'].isin(dfTopEveningEnd.index.values)]
We can see that Net Sinks (red) are concentrated in the middle of NY, where most of the offices are
netMorningSourcePoints = netMorningSource[['start station latitude', 'start station longitude']]
netMorningSinkPoints = netMorningSink[['end station latitude', 'end station longitude']]
netMorningSourcePointsToPlot = netMorningSourcePoints.drop_duplicates()
netMorningSinkPointsToPlot = netMorningSinkPoints.drop_duplicates()
nyMapMorning = folium.Map(location=[40.7230, -73.9712], zoom_start=12)
for index, row in netMorningSourcePointsToPlot.iterrows():
folium.CircleMarker([row.get(0), row.get(1)], radius=3, color='green').add_to(nyMapMorning)
for index, row in netMorningSinkPointsToPlot.iterrows():
folium.CircleMarker([row.get(0), row.get(1)], radius=3, color='red').add_to(nyMapMorning)
nyMapMorning
netEveningSourcePoints = netEveningSource[['start station latitude', 'start station longitude']]
netEveningSinkPoints = netEveningSink[['end station latitude', 'end station longitude']]
netEveningSourcePointsToPlot = netEveningSourcePoints.drop_duplicates()
netEveningSinkPointsToPlot = netEveningSinkPoints.drop_duplicates()
nyMapEvening = folium.Map(location=[40.7230, -73.9712], zoom_start=12)
for index, row in netEveningSourcePointsToPlot.iterrows():
folium.CircleMarker([row.get(0), row.get(1)], radius=3, color='green').add_to(nyMapEvening)
for index, row in netEveningSinkPointsToPlot.iterrows():
folium.CircleMarker([row.get(0), row.get(1)], radius=3, color='red').add_to(nyMapEvening)
nyMapEvening