Twitter Sentiment Analysis

Analysis

Step 0

link the GNIP json file to step_0/input
run step_0/scripts/concat.py

This will produce three (3) files

tweets_0.json
tweets_1.json
tweets_2.json

These file have been uploaded to S3, under the following folders

s3://chipotle-crisis (located in US)
s3://chipotle-crisis-sg (located in Singapore)

Currently, all the three files (frozen version) are also loaded to S3 in s3://chipotle-crisis-final/step_0_results

Step 1

Initialise a Spark cluster in Amazon with the following configuration

emr-4.7.1
Spark 1.6.1
Hive 1.0.0
Hadoop 2.7.2
Hue 3.7.1
Zeppelin-Sandbox 0.5.6
Pig 0.14.0

The hardware configuration is as follows:

Master - m4.xlarge instance
Core - 4x m4.xlarge instance

After the cluster have started up, we need to copy the JSONs file to HDFS for Spark to consume. Do the following (US version):

# Remember to use screen
hadoop fs -mkdir tweets
hdfs dfs -cp s3://chipotle-crisis/tweets_0.json tweets/tweet_0.json
hdfs dfs -cp s3://chipotle-crisis/tweets_1.json tweets/tweet_1.json
hdfs dfs -cp s3://chipotle-crisis/tweets_2.json tweets/tweet_2.json
hadoop fs -ls tweets

For Spark cluster based in Singapore, use the following:

# Remember to use screen
hadoop fs -mkdir tweets
hdfs dfs -cp s3://chipotle-crisis-sg/tweets_0.json tweets/tweet_0.json
hdfs dfs -cp s3://chipotle-crisis-sg/tweets_1.json tweets/tweet_1.json
hdfs dfs -cp s3://chipotle-crisis-sg/tweets_2.json tweets/tweet_2.json
hadoop fs -ls tweets

Download the Python code from GitHub:

wget https://github.com/chuajiesheng/twitter-sentiment-analysis/archive/4483cecf8d9663a21bf3a1db7f2bb9f019ad4c4e.zip
unzip 4483cecf8d9663a21bf3a1db7f2bb9f019ad4c4e.zip

Then start up Spark and run the sampling script. Note: I have realize that the rdd.takeSample return different sample on different machine.

# Note the hash
cd twitter-sentiment-analysis-4483cecf8d9663a21bf3a1db7f2bb9f019ad4c4e


# run Spark
pyspark 
execfile('step_1/scripts/instructional_sampling.py')

In the current working directory, which is twitter-sentiment-analysis-4483cecf8d9663a21bf3a1db7f2bb9f019ad4c4e. You will find the following files:

Sample tweets for demo purpose

sample_posts.csv
sample_posts.json

Development tweets (3,000 tweets):

dev_posts.csv
dev_posts.json

Kappa tweets (300 tweets sampled from development tweets)

kappa_posts.csv
kappa_posts.json

Currently, all the six file (frozen version) have been loaded to S3 in s3://chipotle-crisis-final/step_1_results

Using pySpark on Elastic MapReduce (EMR) in Amazon Web Services (AWS)

Useful Commands

Copying of files from S3

aws s3 cp s3://<filepath> .

Placing files into HDFS from local

hadoop fs -put <filepath> .

Cat files from HDFS (zip file supported)

hdfs dfs -text <filepath>

Unzip file located in HDFS

hadoop fs -cat <filepath> | gzip -d | hadoop fs -put - <output>

Name		Name	Last commit message	Last commit date
Latest commit History 220 Commits
analysis		analysis
apology		apology
classes		classes
document_clustering		document_clustering
exploratory_analysis		exploratory_analysis
liwc_export		liwc_export
methodology/scripts		methodology/scripts
relevant_filtering		relevant_filtering
step_0/scripts		step_0/scripts
step_1/scripts		step_1/scripts
step_2/scripts		step_2/scripts
step_3		step_3
step_4		step_4
subjectivity_clues		subjectivity_clues
tokenizers		tokenizers
twitter_word_clusters		twitter_word_clusters
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
notes.md		notes.md

License

chuajiesheng/twitter-sentiment-analysis

Folders and files

Latest commit

History

Repository files navigation

Twitter Sentiment Analysis

Analysis

Step 0

Step 1

Using pySpark on Elastic MapReduce (EMR) in Amazon Web Services (AWS)

Useful Commands

Copying of files from S3

Placing files into HDFS from local

Cat files from HDFS (zip file supported)

Unzip file located in HDFS

About

Resources

License

Stars

Watchers

Forks

Languages