GitHub - ajunanda/imdb-python-d3-dplr-project: A project to practice different technology stacks such as python, R, d3.js using dataset from IMDB

Create a local work repo & remote repo on Github

The very first step of setting up a project is to create a local work repository, also known as 'repo' in short. You can do this by simply typing mkdir myLocalRepo in the command line. In case your local machine has a failure, you can backup all your work in a remote work repository, hosted on a third party site like Github.com. Remember that github is free so all of your repos will be public and not proprietary.

Creating a remote repo on github is easy, follow the instructions here. Once the remote directory is set up, you can simply git remote add origin https://github.com/robert8138/imdb_project.git in your local work repo, this effectively will link your local working directory to the remote working directory. Anything that you add, modify on the local directory can be push to the remote repo by git push.

Setting up virtualenv

In my particular case, I knew I am going to write a scraper in Python, so upon setting up my local work directory, I didn't just use mkdir. Instead, I created a virtualenv directory by typing virtualenv imdb_project. The advantage of virtualenv is that pip is installed by default, so you can install all the python packages using pip install. Furthermore, it create a virtual environment where all the package management business is taken care of, so you don't need to worry about namespace collision.

To activate the virtualenv environment, type source /bin/activate. Now you are free to pip install your favorite packages

Writing a Python scraper

To get started, I model after the scraping exercise in lab 4 of Harvard CS 109's data science class. The immediate first thing I did before writin a single line of python code is pip install {requests, pattern, BeautifulSoup}.

I basically copied exactly what the lab did, and the result is in scraper.py. The high level idea is simple:

You use request to send a HTTP GET request. In return, we received a HTML object
We then use the pattern.web to parse out the DOM
Once we have the DOM, we can traverse through the DOM to find the information we want
- You can search specific DOM elements by `.by_tag'
- If interested in the attribute value, simply use element.attributeName to see what's in there.
- Use element.HTML or element.content to see the values.

I got the scraper working for the first page, which are only 50 records. If order to store all the data, I will need to loop through and change the start parameter in the url.

A couple of lessons I learned along the way:

How to print Python STDOUT in different colors! very handy for debugging
In web scraping, Google developer tool is super useful. Protip: use the "magnifying class" on the top left and hover over a DOM element, it will show you the element in the HTML code directly!
When you need to replace multiple characters in a string, use re library, in particular, the re.sub('!^&%', yourString) function
How to write to csv in Python

More resources for scraping using Python

Twitter Bootstrap

[Twitter Bootstrap Examples]
[Getting Started]
[Youtube Videos]

MISC

[How to edit an incorrect commit message in Git]

[How to edit an incorrect commit message in Git] [Fix virtualenv]: http://mikeboers.com/blog/2014/12/05/repairing-python-virtual-environments [Twitter Bootstrap Examples]: http://www.tutorialrepublic.com/twitter-bootstrap-examples.php [Getting Started]: http://getbootstrap.com/2.3.2/getting-started.html [Youtube Video]: https://www.youtube.com/watch?v=beUUBc-ueAM&list=PLKlA1QwYBcmcEUUBSmkl8_kgwn-_zuy-W&index=10

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
Rmarkdown		Rmarkdown
bin		bin
css		css
d3		d3
data		data
include		include
lib/python2.7		lib/python2.7
python		python
README.md		README.md
index.html		index.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rmarkdown

Rmarkdown

bin

bin

css

css

d3

d3

data

data

include

include

lib/python2.7

lib/python2.7

python

python

README.md

README.md

index.html

index.html

Repository files navigation

Create a local work repo & remote repo on Github

Setting up virtualenv

Writing a Python scraper

More resources for scraping using Python

Twitter Bootstrap

MISC

About

Releases

Packages

Languages

ajunanda/imdb-python-d3-dplr-project

Folders and files

Latest commit

History

Repository files navigation

Create a local work repo & remote repo on Github

Setting up virtualenv

Writing a Python scraper

More resources for scraping using Python

Twitter Bootstrap

MISC

About

Resources

Stars

Watchers

Forks

Languages