- link the GNIP json file to
step_0/input
- run
step_0/scripts/concat.py
This will produce three (3) files
tweets_0.json
tweets_1.json
tweets_2.json
These file have been uploaded to S3, under the following folders
s3://chipotle-crisis
(located in US)s3://chipotle-crisis-sg
(located in Singapore)
Currently, all the three files (frozen version) are also loaded to S3 in s3://chipotle-crisis-final/step_0_results
Initialise a Spark cluster in Amazon with the following configuration
- emr-4.7.1
- Spark 1.6.1
- Hive 1.0.0
- Hadoop 2.7.2
- Hue 3.7.1
- Zeppelin-Sandbox 0.5.6
- Pig 0.14.0
The hardware configuration is as follows:
- Master - m4.xlarge instance
- Core - 4x m4.xlarge instance
After the cluster have started up, we need to copy the JSONs file to HDFS for Spark to consume. Do the following (US version):
# Remember to use screen
hadoop fs -mkdir tweets
hdfs dfs -cp s3://chipotle-crisis/tweets_0.json tweets/tweet_0.json
hdfs dfs -cp s3://chipotle-crisis/tweets_1.json tweets/tweet_1.json
hdfs dfs -cp s3://chipotle-crisis/tweets_2.json tweets/tweet_2.json
hadoop fs -ls tweets
For Spark cluster based in Singapore, use the following:
# Remember to use screen
hadoop fs -mkdir tweets
hdfs dfs -cp s3://chipotle-crisis-sg/tweets_0.json tweets/tweet_0.json
hdfs dfs -cp s3://chipotle-crisis-sg/tweets_1.json tweets/tweet_1.json
hdfs dfs -cp s3://chipotle-crisis-sg/tweets_2.json tweets/tweet_2.json
hadoop fs -ls tweets
Download the Python code from GitHub:
wget https://github.com/chuajiesheng/twitter-sentiment-analysis/archive/4483cecf8d9663a21bf3a1db7f2bb9f019ad4c4e.zip
unzip 4483cecf8d9663a21bf3a1db7f2bb9f019ad4c4e.zip
Then start up Spark and run the sampling script.
Note: I have realize that the rdd.takeSample
return different sample on different machine.
# Note the hash
cd twitter-sentiment-analysis-4483cecf8d9663a21bf3a1db7f2bb9f019ad4c4e
# run Spark
pyspark
execfile('step_1/scripts/instructional_sampling.py')
In the current working directory, which is twitter-sentiment-analysis-4483cecf8d9663a21bf3a1db7f2bb9f019ad4c4e
.
You will find the following files:
Sample tweets for demo purpose
sample_posts.csv
sample_posts.json
Development tweets (3,000 tweets):
dev_posts.csv
dev_posts.json
Kappa tweets (300 tweets sampled from development tweets)
kappa_posts.csv
kappa_posts.json
Currently, all the six file (frozen version) have been loaded to S3 in s3://chipotle-crisis-final/step_1_results
aws s3 cp s3://<filepath> .
hadoop fs -put <filepath> .
hdfs dfs -text <filepath>
hadoop fs -cat <filepath> | gzip -d | hadoop fs -put - <output>