Skip to content

ajunanda/imdb-python-d3-dplr-project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Create a local work repo & remote repo on Github

The very first step of setting up a project is to create a local work repository, also known as 'repo' in short. You can do this by simply typing mkdir myLocalRepo in the command line. In case your local machine has a failure, you can backup all your work in a remote work repository, hosted on a third party site like Github.com. Remember that github is free so all of your repos will be public and not proprietary.

Creating a remote repo on github is easy, follow the instructions here. Once the remote directory is set up, you can simply git remote add origin https://github.com/robert8138/imdb_project.git in your local work repo, this effectively will link your local working directory to the remote working directory. Anything that you add, modify on the local directory can be push to the remote repo by git push.

Setting up virtualenv

In my particular case, I knew I am going to write a scraper in Python, so upon setting up my local work directory, I didn't just use mkdir. Instead, I created a virtualenv directory by typing virtualenv imdb_project. The advantage of virtualenv is that pip is installed by default, so you can install all the python packages using pip install. Furthermore, it create a virtual environment where all the package management business is taken care of, so you don't need to worry about namespace collision.

To activate the virtualenv environment, type source /bin/activate. Now you are free to pip install your favorite packages

Writing a Python scraper

To get started, I model after the scraping exercise in lab 4 of Harvard CS 109's data science class. The immediate first thing I did before writin a single line of python code is pip install {requests, pattern, BeautifulSoup}.

I basically copied exactly what the lab did, and the result is in scraper.py. The high level idea is simple:

  • You use request to send a HTTP GET request. In return, we received a HTML object
  • We then use the pattern.web to parse out the DOM
  • Once we have the DOM, we can traverse through the DOM to find the information we want
    • You can search specific DOM elements by `.by_tag'
    • If interested in the attribute value, simply use element.attributeName to see what's in there.
    • Use element.HTML or element.content to see the values.

I got the scraper working for the first page, which are only 50 records. If order to store all the data, I will need to loop through and change the start parameter in the url.

A couple of lessons I learned along the way:

  • How to print Python STDOUT in different colors! very handy for debugging
  • In web scraping, Google developer tool is super useful. Protip: use the "magnifying class" on the top left and hover over a DOM element, it will show you the element in the HTML code directly!
  • When you need to replace multiple characters in a string, use re library, in particular, the re.sub('!^&%', yourString) function
  • How to write to csv in Python
More resources for scraping using Python
Twitter Bootstrap
  • [Twitter Bootstrap Examples]
  • [Getting Started]
  • [Youtube Videos]
MISC
  • [How to edit an incorrect commit message in Git]

[How to edit an incorrect commit message in Git] [Fix virtualenv]: http://mikeboers.com/blog/2014/12/05/repairing-python-virtual-environments [Twitter Bootstrap Examples]: http://www.tutorialrepublic.com/twitter-bootstrap-examples.php [Getting Started]: http://getbootstrap.com/2.3.2/getting-started.html [Youtube Video]: https://www.youtube.com/watch?v=beUUBc-ueAM&list=PLKlA1QwYBcmcEUUBSmkl8_kgwn-_zuy-W&index=10

About

A project to practice different technology stacks such as python, R, d3.js using dataset from IMDB

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages