NOTE:Project reviewed according to feedback received in the first round
This repository contains my first project for Udacity's Data Science Nanodegree (Module 1): https://www.udacity.com/course/data-analyst-nanodegree--nd002. That is, the project for module "Introduction to data science", titled "Analyzing the New York Subway Dataset".
The answers to the 'short questions' can be found in the docs folder (https://github.com/jtorrente/nyc-data-analysis/tree/master/docs), along with an example of the output the program generates and large resolution version of the visualizations created. Direct link to short questions PDF file: https://github.com/jtorrente/nyc-data-analysis/blob/master/docs/Answers.pdf
The source code of the project can be found in folder https://github.com/jtorrente/nyc-data-analysis/tree/master/nycsubway/module1. The main file is called 'project1.py'. Direct link to this file: https://github.com/jtorrente/nyc-data-analysis/blob/master/nycsubway/module1/project1.py This repository also contains the code used to complete problem sets 1-4 of the course. Therefore, this code contains lots of contributions from Udacity, so I cannot be considered the sole author. This code is located in folder https://github.com/jtorrente/nyc-data-analysis/tree/master/nycsubway/module1/problemsets.
References used for this project are described in the 'short questions' file. The most relevant source for information and contents I have used is Udacity course materials. Most of the code needed to complete this project was provided by Udacity to help the student complete the different problems and exercises of the Intro to Data Science course. On top of that code base, I have produced new code and improved the existing one to complete the project.
Data files included in the data folder (https://github.com/jtorrente/nyc-data-analysis/tree/master/data) have been downloaded directly from the downloads section of the Udacity course.
Apart from Udacity’s materials, I have used additional sources to get deeper insight into Mann-Whitney’s U test, especially how effect sizes should be reported for this test. It is often argued that when reporting statistical analyses inference tests should be accompanied not only by the value of the statistic used (e.g. ‘t’ or ‘U’) and the p-value (probability of likelihood of the null hypothesis), but also by an estimator of the effect size. This has several benefits. First, it allows for discussing how important the relationship found between dependent and independent variable is. Second, it facilitates meta-review of research results in a particular topic.
In this regard, I have used the rank-biserial coefficient as an estimator of effect size. I have used the next three references about this topic:
http://yatani.jp/teaching/doku.php?id=hcistats:mannwhitney https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test#Rank-biserial_correlation
Wendt, H. W. (1972). Dealing with a common problem in Social science: A simplified rank-biserial coefficient of correlation based on the U statistic. European Journal of Social Psychology, 2(4), 463-465. http://doi.org/10.1002/ejsp.2420020412
I have also used ggplot's, numpy's and panda's online documentation to solve questions and fix problems that came along the way. I have also accessed some threads on stackoverflow, but no code was copied from any of these sources: http://stackoverflow.com/questions/22543776/python-ggplot-issues-plotting-8-stocks-and-legend-is-cutoff http://stackoverflow.com/questions/3606697/how-to-set-limits-for-axes-in-ggplot2-r-plots