Controversy Detection in Twitter Stream

Usage:

Setting up environtment

prepare virtual environment: virtualenv -p python3 venv
source venv/bin/activate
install software dependencies: pip install -r requirements.txt
running Notebook: ipython notebook

Running experiments

All experiments are written in the Notebook scripts with extension ".ipynb"

Section 5.2: dynamic_graph_partitioning.ipynb
Section 5.3: simulation.ipynb

Done

controversy score calculation: controversy_score.ipynb
dynamic graph partitioning: dynamic_graph_partitioning.ipynb
Incremental controversy score update and evaluation
- when new edges are added or old edges are removed,
- incrementally partition the graph and incrementally update the controversy score
- compare the score and running time with calculating the score from scratch
The skeleton code for simulation on twitter stream
Comparing through-put for both incremental (IC) approach and from-scratch (FS) approach
Periodical update improvement: track last_updated_time for each hashtag
Comparing controversy score of IC approach and FS approach
Integrate top-k hashtag selection algorithm
Fixed RWC computation error: 1) works on largest CC 2) thresholding RWC computation by largest CC size (not graph size)

Todo

These are must-do:

Manually check some top controversial events

These are optional:

Incremental graph partitioning:
- add node skipping in incremental graph partitioning
- evaluate the tradeoff between cut objective and computation time
- optimize the graph partitioning code (profiling even cython)
Find a way to summarize the controversial hashtag. For example, what are the typical opnions/tweets of the opposing sides/communities?
- one goal is: by checking the summarization, one can decide whether this hashtag is controversial or not.
Label ground truth on which tags are controversial (so that we can have precision/recall/F1)?

Data preparation

All retweets from 2016 July.

Use twitter_stream_data.py to extract the retweets from raw data.

Controversy score checking

beefban: 2e4, 0.16
baltimore: 9e4, 0.17
ukraine: 5e4, 0.12
curcular: 5e4, 0.0014
star: 5e4, 0.0014
Barabasi: 5e4, 0.0056
ALDUB1stAnniversary: 0.04 (non-controversial)

Normalizing RWC to [0.5, 1.0]

This is done by controversy / (controversy + non_controversy).

And some comparison on the example networks:

barbell: 0.9942951520693315
beefban: 0.8838806129369029
ukraine: 0.8791934195724288
circular: 0.8596169409366637 (this is very unexpected!)
baltimore: 0.8172103276545415
barabasi: 0.5599275837061325
star: 0.5003170710418742

Interesting tags

MTVHottest: 0.12, 4 clusters (seems to be controversial)
PokemonGO: seems to be non-controversial but receices score 0.13, plus the retweet graph is like two stars connected by one edge.

RWC evolution

See figs/{dataset}-volume.png and figs/{dataset}-rwc-vs-time.png.

For beefban, ukraine:

The RWC score is high at the begining.
For beefban, the controversy score seems to go together with the temporal volume, but there are some minor trend differences
For ukraine, the above observation is not very obvious.

For MTVHottest, the volume and RWC score doesn't match in shape. For RWC score, it increases while the volume graph is not very regular.

Installing metis

Install metis wrapper
Install Metis
Important: make config static=1

Dynamic graph partitioning

Check out dynamic_graph_partitioning.ipynb for the code and evaluation.

TODO:

node skipping is not implemented yet
pure python impelementation is slower compared to metis, which is in C++.

Incremental controversy score update

Check out incremental_controversy_score.ipynb.

Some quick result (add/remove 10% of the edges):

average running time reduction is 18%
the Pearson correlation coefficient of RWC scores is: 0.994445283425 with p-value 4.26692170169e-07.

TODO:

what if fewer edges are added/removed?

Throughput test

Refer to simulation.ipynb.

1e5 retweets, 60 mins time window, 5 mins update interval

Incremental: 216 seconds
From scracth: 422 seconds

Correlation of RWCs between IC approach and FS approach

Averaged over RWC scores over multiple updates on multiple graphs.

Average: 0.9, not as high as the previous result (0.99) averaged over multiple graphs but only one update each.

I guess: incremental graph partition accumulates errors.

Plotted RWC score evoluation for #VeranoMTV2016.

Check simulation.ipynb

Stream volume graph of top events

Check simulations.ipynb out.

evaluating hashtag

Check manually_check_hashtag.ipynb

Issues (and solution)

The following issues are encountered, some are solved while some remain open:

how to apply Kiran's method on top this dataset espesially when there are many disconnected components?
- pagerank can deal with that
issue on the controversy score definition:
- the number of high degree nodes should be proportional to the network size: the above experient chose 1e-3
- there is not explicit bound on the score
- solution: takes k percent nodes and use division to bound the score.
hashtag may contain opinion bias such as NoDAPL (protest against oil pipeline)
At the begining, controversial hashtag may induce many disconnected components, how to deal with this?
- A more fundamental question is, how does the graph on controversy-hashtag evolve?
- I take the largest CC and compute RWC based on it.
How to make RWC more robust?
- varying number of partitionings
- scattered CCs
- the largest CC size should be big enough
- for small graphs, like a retweet tree, this method does not work very well.
Computing largest CC
- now is from-scratch
- should make it incremental
RWC does not perform that well
- maybe minimum RWC score needs to be tuned. That's a pain.

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
data		data
figs		figs
output		output
scripts		scripts
.gitignore		.gitignore
README.md		README.md
bipartition_visualization.ipynb		bipartition_visualization.ipynb
controversy_score.ipynb		controversy_score.ipynb
crawl_data.py		crawl_data.py
dynamic_graph_partitioning.ipynb		dynamic_graph_partitioning.ipynb
explore_hashtags.ipynb		explore_hashtags.ipynb
finding_ground_truth.ipynb		finding_ground_truth.ipynb
forceatlas.py		forceatlas.py
incremental_controversy_score.ipynb		incremental_controversy_score.ipynb
leopard.py		leopard.py
manually_check_hashtag.ipynb		manually_check_hashtag.ipynb
play_ground.ipynb		play_ground.ipynb
requirements.txt		requirements.txt
run_simulation.py		run_simulation.py
rwc.py		rwc.py
rwc_evolution.ipynb		rwc_evolution.ipynb
simulation.ipynb		simulation.ipynb
simulation.py		simulation.py
time_windowed_list.ipynb		time_windowed_list.ipynb
tweet_sample.json		tweet_sample.json
twitter_stream_data.ipynb		twitter_stream_data.ipynb
twitter_stream_data.py		twitter_stream_data.py
util.py		util.py

Sandy4321/controversy_detection

Folders and files

Latest commit

History

Repository files navigation