Cross-Domain Learning for Classifying Propaganda in Online Contents

This is the data repository for our propaganda research at the Max Planck Institute for Informatics. Our work investigated the cross-domain learning for propaganda detection by several methods including one proposed pair-wise ranking model. This repository mainly consists of research data as well as the code of the 'LSTMR' model.

Data

The data comprises the following directories and files: All these data are organized by the 'tsv' format (fields are separated by '\t').

News

The news dataset comes from the “Hack the News Datathon”(Please refer: https://www.datasciencesociety.net/hacknews-datathon). We combined and reorganized different datasets from this source, to construct an article-level news corpus (news-a.tsv) and a sentence-level corpus (news-s.tsv). Each of these files contains the following attributes:

SOURCE: all are from News.
LABEL: Propaganda(1) or not (0).
TEXT: The text of an article or a sentence.

Speech

We collected transcripts of speeches from four politicians. Trump and Obama are considered as contemporary speakers, largely talking about the same or related topics. We consider the former as more propagandistic than the latter. We use Joseph Goebbels (the Nazis’ Minister of Propaganda) and Winston Churchill (the Prime Minister of the UK)as prominent figures from the World War II era, with the former being more propagandistic than the latter. The data is organized at two different levels of granularity: articles (speech-a.tsv) and sentences (speech-s.tsv).

SOURCE: The speakers.
LABEL: Propaganda(1) or not (0).
TEXT: The text of an article or a sentence.

tweets

We combine two pre-existing collections of tweets to construct this dataset(tweets.tsv). We consider the Twitter IRA corpus (Edgett, 2017) as propagandistic, and the 'twitter7' data in SNAP (Yang and Leskovec, 2011) as ordinary.

SOURCE: SNAP or IRA.
LABEL: Propaganda(1) or not (0).
TEXT: The text of the tweets.

Code

The code includes the proposed LSTMR model (cross-lstm-rank.py) and the corresponding data preprocessing procedure (other files in this folder). All these codes are based on Python 3.6. The deep learning method is based on Pytorch 1.6. And some other packages are also necessary like sklearn, pandas, NLTK.

Data Privacy

All data was crawled or collected from public resources in compliance with legal regulations. We release the data in its original form, without any masking or anonymization. Researchers who use the data for their own projects are responsible for observing privacy rules (e.g., when combining this data with additional data sources) and other compliance regulations.

References

If you use this dataset, please cite the following paper:

@inproceedings{wang2020crossdomain,
  title={Cross-Domain Learning for Classifying Propaganda in Online Contents},
  author={Wang, Liqiang and Shen, Xiaoyu and de Melo, Gerard and Weikum, Gerhard},
  booktitle={Conference for Truth and Trust Online 2020 (TTO)},
  year={2020}
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
code		code
data		data
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

code

code

data

data

README.md

README.md

Repository files navigation

Cross-Domain Learning for Classifying Propaganda in Online Contents

Data

News

Speech

tweets

Code

Data Privacy

References

About

Releases

Packages

Languages

leereak/propaganda-detection

Folders and files

Latest commit

History

Repository files navigation

Cross-Domain Learning for Classifying Propaganda in Online Contents

Data

News

Speech

tweets

Code

Data Privacy

References

About

Resources

Stars

Watchers

Forks

Languages