The very first step of setting up a project is to create a local work repository, also known as 'repo' in short. You can do this by simply typing mkdir myLocalRepo
in the command line. In case your local machine has a failure, you can backup all your work in a remote work repository, hosted on a third party site like Github.com. Remember that github is free so all of your repos will be public and not proprietary.
Creating a remote repo on github is easy, follow the instructions here. Once the remote directory is set up, you can simply git remote add origin https://github.com/robert8138/imdb_project.git
in your local work repo, this effectively will link your local working directory to the remote working directory. Anything that you add, modify on the local directory can be push
to the remote repo by git push
.
In my particular case, I knew I am going to write a scraper in Python, so upon setting up my local work directory, I didn't just use mkdir
. Instead, I created a virtualenv directory by typing virtualenv imdb_project
. The advantage of virtualenv is that pip is installed by default, so you can install all the python packages using pip install
. Furthermore, it create a virtual environment where all the package management business is taken care of, so you don't need to worry about namespace collision.
To activate the virtualenv environment, type source /bin/activate
. Now you are free to pip install your favorite packages
To get started, I model after the scraping exercise in lab 4 of Harvard CS 109's data science class. The immediate first thing I did before writin a single line of python code is pip install {requests, pattern, BeautifulSoup}
.
I basically copied exactly what the lab did, and the result is in scraper.py. The high level idea is simple:
- You use request to send a HTTP GET request. In return, we received a HTML object
- We then use the
pattern.web
to parse out the DOM - Once we have the DOM, we can traverse through the DOM to find the information we want
- You can search specific DOM elements by `.by_tag'
- If interested in the attribute value, simply use
element.attributeName
to see what's in there. - Use
element.HTML
orelement.content
to see the values.
I got the scraper working for the first page, which are only 50 records. If order to store all the data, I will need to loop through and change the start parameter in the url.
A couple of lessons I learned along the way:
- How to print Python STDOUT in different colors! very handy for debugging
- In web scraping, Google developer tool is super useful. Protip: use the "magnifying class" on the top left and hover over a DOM element, it will show you the element in the HTML code directly!
- When you need to replace multiple characters in a string, use re library, in particular, the re.sub('!^&%', yourString) function
- How to write to csv in Python
- Web scraping 101 with Python
- More web scraping with Python
- Y combinator's response and comparison on scraping packages
- Libraries:
- [Twitter Bootstrap Examples]
- [Getting Started]
- [Youtube Videos]
- [How to edit an incorrect commit message in Git]
[How to edit an incorrect commit message in Git] [Fix virtualenv]: http://mikeboers.com/blog/2014/12/05/repairing-python-virtual-environments [Twitter Bootstrap Examples]: http://www.tutorialrepublic.com/twitter-bootstrap-examples.php [Getting Started]: http://getbootstrap.com/2.3.2/getting-started.html [Youtube Video]: https://www.youtube.com/watch?v=beUUBc-ueAM&list=PLKlA1QwYBcmcEUUBSmkl8_kgwn-_zuy-W&index=10