GitHub - rvijay17/CitiBike: Analysis of CitiBike data

import folium
import pandas as pd
import numpy as np
from datetime import datetime
from datetime import timedelta
import seaborn as sns
import matplotlib.pyplot as plt

Importing trip data for March 2016 and Weather data from Weather Underground.

The source data is read in a separate Python class. It is manipulated as a DataFrame and finally exported into Parquet (or Feather) format. Both parquet and feather are well supported but parquest has wider acceptance at this moment. I also noticed that parquet files compress better with snappy. To know more about Feather, read https://blog.rstudio.org/2016/03/29/feather/

The ETL part of the analysis is in a separate file (CitiBike_ETL.py). It reads all .csv files and converts it into parquet. We could also have partioned the parquet files based on the column we'll be joining with other datasets. But for this analysis, I have not partitioned as I'm analysing just a month's data. During analysis, this notebook will be run multiple times and it is efficient to just import the DataFrame

Importing trip data for March 2016 and Weather data from Weather Underground

# Read the data files

bikedata = pd.read_parquet("./data/bikedata_201603.parquet")
weather = pd.read_csv('./data/weather.csv')

Drop the unnamed column

weather.drop('Unnamed: 0', axis=1, inplace=True)

Data manipulation

Making both the keys in the dataframes of same type (<type 'datetime.date'>)

bikedata['dtstarttime'] = pd.to_datetime(bikedata.starttime, format="%m/%d/%Y %H:%M:%S")
bikedata['dtstoptime'] = pd.to_datetime(bikedata.stoptime, format="%m/%d/%Y %H:%M:%S")
bikedata['date'] = bikedata.dtstarttime.dt.date
weather['date'] = pd.to_datetime(weather.date, format="%Y-%m-%d").dt.date

Set the startdate and stopdate - minutes and seconds reset to 0 (in the following format - 2016-03-01 06:00:00). This has been done so that we could aggregate departures and arrivals per hour to identify bike usage

bikedata['dtstartdatehour'] = bikedata.dtstarttime.apply(lambda x:x.replace(minute=0,second=0))
bikedata['dtstopdatehour'] = bikedata.dtstoptime.apply(lambda x:x.replace(minute=0,second=0))

Display the data types of the data frame

bikedata.dtypes

tripduration                        int64
starttime                          object
stoptime                           object
start station id                    int64
start station name                 object
start station latitude            float64
start station longitude           float64
end station id                      int64
end station name                   object
end station latitude              float64
end station longitude             float64
bikeid                              int64
usertype                           object
birth year                        float64
male                                int64
female                              int64
dtstartdatehour            datetime64[ns]
dtstarttime                datetime64[ns]
dtstoptime                 datetime64[ns]
date                               object
dtstopdatehour             datetime64[ns]
dtype: object

Initial Data Analysis

At what time are they busy?

As we can see below in the graph, the busies times in the month of March are 8-10 am and 5-6 pm. It also seems that people prefer taking bikes in the evening after work

aggByHour = bikedata.groupby(by=bikedata.dtstartdatehour.dt.hour).count()['start station id']
sns.barplot(aggByHour.index, aggByHour.values)

<matplotlib.axes._subplots.AxesSubplot at 0x7faaa77a3cc0>

Which Start and End stations are the busiest?

As you can see below, the busiest start and end stations are almost the same ones. We can deduce a few things from this.

The bike company could employ more staff in these stations
They could keep these stations full in the mornings and evenings

aggByStartSt = bikedata.groupby(by=['start station name']).count()['start station id'].sort_values(ascending=False).head(10)
aggByEndSt = bikedata.groupby(by=['end station name']).count()['end station id'].sort_values(ascending=False).head(10)

f, axes = plt.subplots(1,2,figsize=(12, 6))

g1 = sns.barplot(aggByStartSt.index, aggByStartSt.values, ax=axes[0])\
.set_xticklabels(aggByStartSt.index, rotation=90)

g2 = sns.barplot(aggByEndSt.index, aggByEndSt.values, ax=axes[1])\
.set_xticklabels(aggByEndSt.index, rotation=90)

# Merging weather and bike data
df = pd.merge(bikedata, weather, on='date', how='left')
df.dtypes

How many trips start and end at the same station?

We can see that only 1.12% of the bikes start and end at the same station

float(len(bikedata[bikedata['start station id'] == bikedata['end station id']]))*100/len(bikedata)

1.1209185591134498

More data manipulation
Extract the Hour component from starttime and stoptime

#startHour = bikedata.starttime.str.split().str[1].str.split(':').str[0].astype('int')
#endHour = bikedata.stoptime.str.split().str[1].str.split(':').str[0].astype('int')

startHour = bikedata.dtstarttime.dt.hour
endHour = bikedata.dtstoptime.dt.hour

# All data where starttime is before 9 am
# dfStopBeforeTen = bikedata[endHour <= 9]

# All data where starttime is after 4 pm
# dfStartAfterFour = bikedata[startHour >= 16]

Getting all data for mornings (before 10 am) and evenings (after 4 pm)

dfMorning = bikedata[endHour <= 10]
dfEvening = bikedata[endHour >= 16]

dfMorningStart = dfMorning.groupby(by=['start station id'])['bikeid'].count().sort_values(ascending=False)
dfMorningEnd = dfMorning.groupby(by=['end station id'])['bikeid'].count().sort_values(ascending=False)

dfEveningStart = dfEvening.groupby(by=['start station id'])['bikeid'].count().sort_values(ascending=False)
dfEveningEnd = dfEvening.groupby(by=['end station id'])['bikeid'].count().sort_values(ascending=False)

topN = 750

# We are finding the top N start and end stations id before 10am
dfTopMorningStart = dfMorningStart[dfMorningStart > topN]
dfTopMorningEnd = dfMorningEnd[dfMorningEnd > topN]

# We are finding the top N start and end stations id ater 4pm
dfTopEveningStart = dfEveningStart[dfEveningStart > topN]
dfTopEveningEnd = dfEveningEnd[dfEveningEnd > topN]

netMorningSource = bikedata[bikedata['end station id'].isin(dfTopMorningStart.index.values)]
netMorningSink = bikedata[bikedata['end station id'].isin(dfTopMorningEnd.index.values)]

netEveningSource = bikedata[bikedata['end station id'].isin(dfTopEveningStart.index.values)]
netEveningSink = bikedata[bikedata['end station id'].isin(dfTopEveningEnd.index.values)]

Plotting just the morning traffic

We can see that Net Sinks (red) are concentrated in the middle of NY, where most of the offices are

netMorningSourcePoints = netMorningSource[['start station latitude', 'start station longitude']]
netMorningSinkPoints = netMorningSink[['end station latitude', 'end station longitude']]

netMorningSourcePointsToPlot = netMorningSourcePoints.drop_duplicates()
netMorningSinkPointsToPlot = netMorningSinkPoints.drop_duplicates()

nyMapMorning = folium.Map(location=[40.7230, -73.9712], zoom_start=12)

for index, row in netMorningSourcePointsToPlot.iterrows():
    folium.CircleMarker([row.get(0), row.get(1)], radius=3, color='green').add_to(nyMapMorning)

for index, row in netMorningSinkPointsToPlot.iterrows():
    folium.CircleMarker([row.get(0), row.get(1)], radius=3, color='red').add_to(nyMapMorning)

nyMapMorning

Plotting just the evening traffic

netEveningSourcePoints = netEveningSource[['start station latitude', 'start station longitude']]
netEveningSinkPoints = netEveningSink[['end station latitude', 'end station longitude']]

netEveningSourcePointsToPlot = netEveningSourcePoints.drop_duplicates()
netEveningSinkPointsToPlot = netEveningSinkPoints.drop_duplicates()

nyMapEvening = folium.Map(location=[40.7230, -73.9712], zoom_start=12)

for index, row in netEveningSourcePointsToPlot.iterrows():
    folium.CircleMarker([row.get(0), row.get(1)], radius=3, color='green').add_to(nyMapEvening)

for index, row in netEveningSinkPointsToPlot.iterrows():
    folium.CircleMarker([row.get(0), row.get(1)], radius=3, color='red').add_to(nyMapEvening)

nyMapEvening

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.ipynb_checkpoints		.ipynb_checkpoints
data		data
src		src
.DS_Store		.DS_Store
CitiBike_Initial_Analysis.ipynb		CitiBike_Initial_Analysis.ipynb
CitiBike_Initial_Analysis_March2016.ipynb		CitiBike_Initial_Analysis_March2016.ipynb
CitiBike_Model_Evaluation.ipynb		CitiBike_Model_Evaluation.ipynb
README.html		README.html
README.md		README.md
output_15_1.png		output_15_1.png
output_18_0.png		output_18_0.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.ipynb_checkpoints

.ipynb_checkpoints

data

data

src

src

.DS_Store

.DS_Store

CitiBike_Initial_Analysis.ipynb

CitiBike_Initial_Analysis.ipynb

CitiBike_Initial_Analysis_March2016.ipynb

CitiBike_Initial_Analysis_March2016.ipynb

CitiBike_Model_Evaluation.ipynb

CitiBike_Model_Evaluation.ipynb

README.html

README.html

README.md

README.md

output_15_1.png

output_15_1.png

output_18_0.png

output_18_0.png

Repository files navigation

Initial Data Analysis

At what time are they busy?

Which Start and End stations are the busiest?

How many trips start and end at the same station?

Plotting just the morning traffic

Plotting just the evening traffic

About

Releases

Packages

Contributors 2

Languages

rvijay17/CitiBike

Folders and files

Latest commit

History

Repository files navigation

Initial Data Analysis

At what time are they busy?

Which Start and End stations are the busiest?

How many trips start and end at the same station?

Plotting just the morning traffic

Plotting just the evening traffic

About

Resources

Stars

Watchers

Forks

Languages