GitHub

How to Run

Part 1: Tweet Streaming

Dependencies: tweepy

Run in command line:

python tweetstreams.py

This piece of code gets tweet streams using tweepy, and stores extended tweet json objects in file (Default= twitterDoc.txt)

Part 2: Indexing and Searching

Dependencies: pylucene, json

Related code: generateIndex.py, searchIndex.py, index.py

Run in command line:

python generateIndex.py index_folder

This python script generate index using Lucene and save indexed files to index_folder. (Default input=twitterDoc.txt)

python searchIndex.py index_folder

This piece of code search and return result of query using indexed files in index_folder.

Part 3: Popularity Measurement using Mapreduce

Dependencies: hadoop, json

Hadoop version: 2.7.7 (hadoop-streaming-2.7.7.jar)

Related code: map.py, reduce.py, ave.py

Run in command line:

hadoop jar /usr/local/src/hadoop-2.7.7/share/hadoop/tools/lib/hadoop-streaming-2.7.7.jar
-file 'map.py' -file 'reduce.py'
-mapper "python map.py" -reducer "python reduce.py"
-input 'input.txt' -output 'result'

This mapreduce job counts the total number of retweet, quote, reply and favorite of each hashtag. However, the data should also be normalized by the size of file. Normalizing:

python ave.py

In output file (default=ave.txt) is the normalized popularity of topics(sorted from high to low).

Saved Files

Folder index/ contains the indexing results.

In folder mapReduceResults, there are 4 raw results of mapreduce (Named as result_Dec*) and 1 normalized result Normalized Popularity

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
index		index
mapReduceResults		mapReduceResults
.gitattributes		.gitattributes
.gitignore		.gitignore
Final Report.pdf		Final Report.pdf
LICENSE		LICENSE
README.md		README.md
ave.py		ave.py
generateIndex.py		generateIndex.py
index.py		index.py
map.py		map.py
reduce.py		reduce.py
run.sh		run.sh
searchIndex.py		searchIndex.py
tweet.py		tweet.py
tweetstreams.py		tweetstreams.py

License

benjaminmao123/TwitterCrawler

Folders and files

Latest commit

History

Repository files navigation

How to Run

Part 1: Tweet Streaming

Part 2: Indexing and Searching

Part 3: Popularity Measurement using Mapreduce

Saved Files

About

Resources

License

Stars

Watchers

Forks

Languages