Web Crawler for CMU 11601

Using the Crawler

This project has external dependencies on the following packages:
1. WordCloud
2. Image
3. Matplotlib
4. BeautifulSoup 4
5. PrettyTable
6. MySql
Prior to building the project, make sure that you have all the external packages installed by running:

sudo pip install wordcloud image matplotlib beautifulsoup4 prettytable mysql-python

To start crawler
- Type
```
python cralwer.py Keyword Mode
```
- By default, the crawler runs in DFS with keyword "java"
- User should enter both the keyword and crawling method ("dfs" or "bfs") in order to override default setting
To pause routine:
- control + C
- To resume, press any key
- To exit, type "quit"
- To print the wordCloud Imge, type "print"

Since we do not maintain an email server, emails are sent to the Gmail server (other types of email accounts are currently not supported)
In order to use the email notification feature, users need to have a Gmail account (or sign up for one)
Users need to allow "less secure apps" in their Gmail settings

This project uses mysql database
This project assumes that you have already create a database named "test" in mysql. And you should have a database user named "root", the passwprd should be "root". the database will store the data in local machine.
if you do not want to use a new database or new user and new password or you want to store the data in remote machine, you need to change the following code
[con = mdb.connect('localhost', 'root', 'root', 'test');]
as [con = mdb.connect(New_address, New_user, New_password, New_database_name);]
When user runs the crawler, the database will creates a table name keywordKEYWORD_dfs(bfs). in this table, we record the title, link of website and the insert time.
When the code quits, the database will create two other tables, one is used to store the websites in visited list named KEYWORDvisited_dfs(bfs), another is used to store the websites in unvisited list named KEYWORDunvisited_dfs(bfs), in otherwords, we will have threetables if we quit the code. otherwise only one table.
Every time when the code starts, the code will check if there are responding table in database, if yes, user can choose start from the previous point or restart from the beginning.
There is a file named "saveDate" which contains functions to put one record into database once.

According to the project handout, our crawler needs to have the following features at the very least:

Takes as input keywords that must be present on a web page
Recursively crawls links on a web page
User may specify breadth-first or depth-first crawling
Rests at least two seconds between each recursive step for pages within the same domain (no rest required for pages on other domains)
Can be terminated at any time by the user
Has no preset limit to recursion depth
Stores data in a database
Provides real-time status to the user about how many pages has been crawled, what is currently being crawled, what is planned, etc. (use some creativity here)
May be either command line or web-based application.
Define at least four additional features not listed above

User can set his email, when the routine is finished, rontine would send an email mention user this situation
User can print current web page as wordcloud image
Routine would compute the score of each web page base on the number of outlinks containing the given keyword
When the crawler finishes (or user terminated), routine would print a top 10 relevant web page according to the grade of each web page

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
image		image
BFS_crawl.py		BFS_crawl.py
DFS_crawl.py		DFS_crawl.py
README.md		README.md
check.py		check.py
connectMangoDB.py		connectMangoDB.py
crawler.py		crawler.py
crawler_utils.py		crawler_utils.py
saveData.py		saveData.py
sendemail.py		sendemail.py
settings.py		settings.py
start.py		start.py
wordcloud.png		wordcloud.png
wordcloudprint.py		wordcloudprint.py