tttt

Technology Trends Through Tweets

Configuration

Install

install virtual environments

 virtualenv ~/venv
 pip install tweepy fabric matplotlib
 source ~/venv/bin/activate

install hadoop

  cd ~
  tar zxvf hadoop-2.5.1.tar.gz
  mv hadoop-2.5.1 hadoop
  git clone TTTTREPO
  cp -r TTTTREPO/etc

install dstat

  cd ~
  wget -c http://dag.wieers.com/home-made/dstat/dstat-0.7.2.tar.bz2
  tar jxvf dstat-0.7.2.tar.bz2
  cp dstat-0.7.2.tar.bz2/dstat ~

Quick alias

vim ~/.bash_aliases

# change hadoop path to your actual hadoop
alias hadoop='/tmp/dfs/hadoop/bin/hadoop'
alias hdfs='/tmp/dfs/hadoop/bin/hdfs'

Environment

vim ~/hadoop/etc/hadoop/hadoop-env.sh

export JAVA_HOME=/usr/lib/jvm/java
# you could also try
# export JAVA_HOME=/usr/lib/jvm/java-1.7.0

# add these environment variables below JAVA_HOME
export PATH=$JAVA_HOME/bin:$PATH
# let you could call official java compile function to compile MapReduce file
HADOOP_CLASSPATH=$JAVA_HOME/lib/tools.jar

Single hadoop mode

cd to/WordCount/folder
cp something input/
rm -rf output
make
hadoop jar wc.jar WordCount input/ output

Cluster mode

modify hadoop/etc/hadoop/slaves file, which would look like

     macaroni-01.cs.wisc.edu
     macaroni-02.cs.wisc.edu
     macaroni-03.cs.wisc.edu

upload file into HDFS

  hdfs dfs -mkdir -p /cs736/input
  hdfs dfs -put YOURINPUTFILE /cs736/input/
  hdfs dfs -ls /cs736/input/
  hadoop jar wc.jar WordCount /cs736/input/ /cs736/output
  hdfs dfs -ls /cs736/output/

Check cluster status

Using fabric

Configure the master and slaves

 cp config.py.sample config.py
 vim config.py
 EDITYOURMASTERSANDSLAVES

fab init
fab start
hdfs dfs -ls /

Stop hadoop

fab stop

Restart hadoop

fab restart

Manually operate dstat

fab start_dstat: start dstat, which would last for 3600 seconds (configurable)
fab stop_dstat

Copy new configurations

fab copy

Add nodes

Assuming macaroni-01 is the only slave currently, and we will add macaroni-02 macaroni-03. And macaroni-05 is the master

fab stop: stop all nodes.

modify your config.py like:

  MACHINES = {
      #'master': [macaroni-05],
      'master': [],
      'slave': [
          #'macaroni-01',
          'macaroni-02',
          'macaroni-03',
      ]
      }

fab init (Attention, never reinitialize your master!)
uncomment the machines in the config.py
fab start
- then fabric would say something like No hosts found. Please specify (single) host string for connection:. Just type in an arbitrary new slave hostname, in this case: macaroni-03. (It is a hack)

hadoop/etc/hadoop/slaves seems to be irrelevant with specifying the slave nodes. However, you have to modify it if you do not use fabric.

Known Bugs

the nodemanager in slave should be started manually
Off safemode hadoop dfsadmin -safemode leave

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
TweetOutputLineFixer		TweetOutputLineFixer
WordCountResultSumarizer		WordCountResultSumarizer
WordCounter		WordCounter
etc		etc
experiments		experiments
tweepy		tweepy
.README.md.swo		.README.md.swo
.classpath		.classpath
.gitignore		.gitignore
.project		.project
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
WordCount.java		WordCount.java
WordCount2.XXX		WordCount2.XXX
a		a
all_stream_tweets.sh		all_stream_tweets.sh
argparse.py		argparse.py
config		config
config.py		config.py
config.py.sample		config.py.sample
config16.py		config16.py
config2.py		config2.py
config32.py		config32.py
config4.py		config4.py
config47.py		config47.py
config48BAD.py		config48BAD.py
config8.py		config8.py
configHeader		configHeader
execHadoop		execHadoop
fabfile.py		fabfile.py
fabstart		fabstart
fullRunResults1		fullRunResults1
grabOutput		grabOutput
hdfsSetupTime		hdfsSetupTime
hdfsSetupTime2		hdfsSetupTime2
hdfsSetupTime47		hdfsSetupTime47
hdfsls		hdfsls
hdfsout		hdfsout
hdfsremove		hdfsremove
hdfssetup		hdfssetup
hdfssetuptest		hdfssetuptest
outputSlaves.py		outputSlaves.py
part-r-00000		part-r-00000
plot.py		plot.py
results16Nodes		results16Nodes
results16NodesFR		results16NodesFR
results16NodesFirstRun		results16NodesFirstRun
results2Nodes		results2Nodes
results2NodesFR		results2NodesFR
results2NodesFirstRun		results2NodesFirstRun
results32Nodes		results32Nodes
results32NodesFirstRun		results32NodesFirstRun
results47Nodes		results47Nodes
results47NodesFR		results47NodesFR
results47NodesFirstRun		results47NodesFirstRun
results4Nodes		results4Nodes
results4NodesFR		results4NodesFR
results4NodesFirstRun		results4NodesFirstRun
results8Nodes		results8Nodes
results8NodesFR		results8NodesFR
results8NodesFirstRun		results8NodesFirstRun
run.sh		run.sh
run16		run16
runAll		runAll
runHadoop		runHadoop
setup		setup
setupNode		setupNode
ss		ss
stream_tweets.sh		stream_tweets.sh
twitter_streamer.py		twitter_streamer.py
twitter_streamerA.py		twitter_streamerA.py
twitter_streamerB.py		twitter_streamerB.py
twitter_streamerC.py		twitter_streamerC.py
twitter_streamer_README		twitter_streamer_README
writeSlaves		writeSlaves
~		~

License

fmacias64/TTTT

Folders and files

Latest commit

History