Skip to content

jtorrente/nyc-data-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NOTE:Project reviewed according to feedback received in the first round

This repository contains my first project for Udacity's Data Science Nanodegree (Module 1): https://www.udacity.com/course/data-analyst-nanodegree--nd002. That is, the project for module "Introduction to data science", titled "Analyzing the New York Subway Dataset".

The answers to the 'short questions' can be found in the docs folder (https://github.com/jtorrente/nyc-data-analysis/tree/master/docs), along with an example of the output the program generates and large resolution version of the visualizations created. Direct link to short questions PDF file: https://github.com/jtorrente/nyc-data-analysis/blob/master/docs/Answers.pdf

The source code of the project can be found in folder https://github.com/jtorrente/nyc-data-analysis/tree/master/nycsubway/module1. The main file is called 'project1.py'. Direct link to this file: https://github.com/jtorrente/nyc-data-analysis/blob/master/nycsubway/module1/project1.py This repository also contains the code used to complete problem sets 1-4 of the course. Therefore, this code contains lots of contributions from Udacity, so I cannot be considered the sole author. This code is located in folder https://github.com/jtorrente/nyc-data-analysis/tree/master/nycsubway/module1/problemsets.

References used for this project are described in the 'short questions' file. The most relevant source for information and contents I have used is Udacity course materials. Most of the code needed to complete this project was provided by Udacity to help the student complete the different problems and exercises of the Intro to Data Science course. On top of that code base, I have produced new code and improved the existing one to complete the project.

Data files included in the data folder (https://github.com/jtorrente/nyc-data-analysis/tree/master/data) have been downloaded directly from the downloads section of the Udacity course.

Apart from Udacity’s materials, I have used additional sources to get deeper insight into Mann-Whitney’s U test, especially how effect sizes should be reported for this test. It is often argued that when reporting statistical analyses inference tests should be accompanied not only by the value of the statistic used (e.g. ‘t’ or ‘U’) and the p-value (probability of likelihood of the null hypothesis), but also by an estimator of the effect size. This has several benefits. First, it allows for discussing how important the relationship found between dependent and independent variable is. Second, it facilitates meta-review of research results in a particular topic.

In this regard, I have used the rank-biserial coefficient as an estimator of effect size. I have used the next three references about this topic:

http://yatani.jp/teaching/doku.php?id=hcistats:mannwhitney https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test#Rank-biserial_correlation

Wendt, H. W. (1972). Dealing with a common problem in Social science: A simplified rank-biserial coefficient of correlation based on the U statistic. European Journal of Social Psychology, 2(4), 463-465. http://doi.org/10.1002/ejsp.2420020412

I have also used ggplot's, numpy's and panda's online documentation to solve questions and fix problems that came along the way. I have also accessed some threads on stackoverflow, but no code was copied from any of these sources: http://stackoverflow.com/questions/22543776/python-ggplot-issues-plotting-8-stocks-and-legend-is-cutoff http://stackoverflow.com/questions/3606697/how-to-set-limits-for-axes-in-ggplot2-r-plots

About

Analysis of NYC subway ridership data using python

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages