From a few RSS feeds from the biggest News Sources, build a database of cleaned articles.
Here's what you can do to get started:
- Clone our github, look at what we are working on, talk to us to see how you could help.
- Follow the installation guide
- Check out our issues on Github and get started.
If you're on OSX or Linux, you already have Python 2
. Run python -V
in terminal and make sure that you're at least on 2.7
.
If you're on Windows, download it from python.org. Make sure to pick a 2
version and not a 3
version.
If you're on OSX or Linux, you already have pip
installed.
If you're on Windows, you might already have pip
(or not). Go to C:\python27\Scripts
and see if pip.exe
is there. If it is, you already have pip
.
If you don't, download and install it.
You should also add pip
to your path. For Windows, add ;C:\python27\Scripts
to the end of it.
Out of the box, computers generally don't ship with Mongo. Unless you know you've already installed it, you probably don't have it. We currently use version 3.2; version 3.0 will not work, and we cannot vouch for 3.1 either.
Windows and OSX: Download and install it
Linux: Download and install a binary from the above link or setup your package manager to install it. If you don't know which option to use, try setting it up with your package manager.
You also need to create the directory /data/db
. This is where Mongo stores its data.
- OSX and Linux:
sudo mkdir -p -m 777 /data/db
- Windows: Create the folder
C:\data
and then create the folderC:\data\db
.
If you're on Windows, you'll also need to add Mongo to your path. It's probably located at C:\mongodb\bin\
, C:\Program Files\mongodb\bin
, or C:\Program Files\MongoDB\Server\x.x\bin
(x.x
denotes the version number, please don't literally put x.x
).
Only up a terminal in this project's root directory (the folder this file is in). Then type:
pip install -r requirements.txt
If that complains about directories not being writable (probably on OSX and Linux), type sudo pip install -r requirements.txt
instead.
If all of the setup stuff worked, just type mongod
in a terminal window and you're good to go. You'll need to leave that window open in the background while you do stuff with the crawler. If mongod
exits with an error, the process failed to start. You should talk to someone to get that fixed.
We now store the list of feeds to crawl from the production database. To get them, open up mongo
(make sure mongod
is still running in a separate window).
mongo
Then switch to the big_data
database. This is where we store all of our data. Mongo automatically makes databases that don't exist so there's no special procedure for making a new database.
use big_data
If you don't know what state your local database is in, then you need to drop the feed
and test_sources
collections.
db.feed.drop()
db.test_sources.drop()
Finally, copy the feed
and test_sources
collections from the production database to your local database.
db.cloneCollection('db.retinanews.net', 'feed')
db.cloneCollection('db.retinanews.net', 'test_sources')
db.cloneCollection('db.retinanews.net', 'source_cleaning')
python crawler.py
The first time you run it (or the first time you run it in a while), it will take a few minutes to finish.
To check if everything's working, open up a terminal and type the following:
mongo
- That will run an interactive shell that connects to your local MongoDB.use big_data
- That connects you to the database where we store all of the crawled articles.db.qdoc.findOne()
- That will split out any article in theqdoc
collection (table).
If the last command prints out a lot of text and it looks like a news article, congrats! Otherwise, talk to someone in the Contact
section for help.
If you have any questions, feel free to talk to Sam (smarder3@gatech.edu) or Matt (mersted@gatech.edu).
Also, Philippe (plaban3@gatech.edu) originally created the crawler. You should ask him about any weird things you see in the code.