Skip to content

dataset and source code for clustering and detecting newspapers articles

Notifications You must be signed in to change notification settings

riedlma/cluster_identification

Repository files navigation

Clustering-Based Article Identification in Historical Newspapers

source code and text collection for the paper "Clustering-Based Article Identification in Historical Newspapers" published at the LaTech workshop.

Table of Content

Clustering of Texts

Before the clustering can be performed, the dataset needs to be downloaded and the segmentation boundaries need to be known. Then, the clustering can be performed with the following script:

python execute_clustering_gold_standard_arg.py dataset/corpus_txt dataset/annotations/ -esc -pd -rs 1 2 3 4 5 

The script allows lots of parameters, which are listed in the help command:

python execute_clustering_gold_standard_arg.py --help
usage: execute_clustering_gold_standard_arg.py [-h] [-e100 EMBEDDINGS100]
                                               [-e200 EMBEDDINGS200]
                                               [-mo MIN_OCR]
                                               [-aaf AUTOMATIC_ANNOTATION_FOLDER]
                                               [-sc] [-esc] [-nc NC [NC ...]]
                                               [-n NGRAM [NGRAM ...]] [-pf]
                                               [-pd] [-pa] [-jws]
                                               [-rs RS [RS ...]]
                                               document_folder
                                               annotation_folder

Execute the clustering and segmentation and the evaluation

positional arguments:
  document_folder       folder for the text document
  annotation_folder     folder for the annotations of the text document

optional arguments:
  -h, --help            show this help message and exit
  -e100 EMBEDDINGS100, --embeddings100 EMBEDDINGS100
                        binary for the 100 dimensional fastText embeddings. If
                        not active, they will not be used.
  -e200 EMBEDDINGS200, --embeddings200 EMBEDDINGS200
                        binary for the 200 dimensional fastText embeddings. If
                        not active, they will not be used.
  -mo MIN_OCR, --min-ocr MIN_OCR
                        minimum OCR score (default: -100.0)
  -aaf AUTOMATIC_ANNOTATION_FOLDER, --automatic_annotation_folder AUTOMATIC_ANNOTATION_FOLDER
  -sc, --spectral_clustering
                        use standard spectral clustering
  -esc, --exponential_spectral_clustering
                        use exponential spectral clustering
  -nc NC [NC ...], --number_of_cluster NC [NC ...]
                        Specify the number of clusters to be used (can be a
                        list of numbers) [1-15,20,30,40,50,60,100]
  -n NGRAM [NGRAM ...], --ngram NGRAM [NGRAM ...]
                        specify the N for the n-grams that are extracted
                        (default: 3)
  -pf, --process_file   Process each file individually
  -pd, --process_day    Process files day-wise
  -pa, --process_all    Process all files
  -jws, --jaccard_word_sim
                        apply Jaccard Word similarity
  -rs RS [RS ...], --random_states RS [RS ...]
                        Using this option, the clustering will be performed as
                        often as seeds are provided. If none is given, a time-
                        based random seed is used.

Automatic Text Segmentation

The automatic segmentation using TextTiling can be performed with the following command, expecting as parameters an input directory of files and that will write the segmented documents to output_directory.

python texttiling_app.py input_directory output_directory

In order to evaluate the segmentation, the following script can be used:

python texttiling_eval.py input_directory annotation_dir min_ocr

The input_directory contains the text documents, the annotation_dir the annotated segment boundaries and the min_ocr the mininmal OCR score that a file needs to fullfy in order to be considered for the evaluation.

Dataset:

We annotated pages from the March 1912 issues of the New York tribune Sunday magazine. We annotated both advertisements and articles. We provide the annotations for the dataset, however, the raw texts of the dataset ca be downloaded using the script:

sh scripts/download_newspaper.sh 

Then, all data for the task are located in the folder "dataset". There are the following folder and file:

  • annotations: annotation for each page
  • corpus_txt: all text files of the pages OCRed from the pdfs.
  • corpus_pdf: all PDFs of the pages
  • content.csv: file, listing all articles including their title and author (if available)

Embeddings

Embeddings generated for English for the year 1912 are generated and can be downloaded using the following script:

sh scripts/download_embeddings.sh 

Replicate results from the paper

First you need to download dataset and embeddings:

sh scripts/download_newspaper.sh
sh scripts/download_embeddings.sh 

Results with gold segments

By Issue:

python execute_clustering_gold_standard_arg.py dataset/corpus_txt dataset/annotations/ -esc -pd  -n 2 3 4 5 6 7 8 -rs 1 2 3 4 5 -jws -e200 embeddings/year1912.clean.txt.fasttext.200.bin -nc 10 11 12 13 14 15 

All Issues

python execute_clustering_gold_standard_arg.py dataset/corpus_txt dataset/annotations/ -esc -pa  -n 2 3 4 5 6 7 8 -rs 1 2 3 4 5 -jws -e200 embeddings/year1912.clean.txt.fasttext.200.bin -nc 50 51 52 53 54 55 

Results with automatic segments

First, we perform TextTiling:

python texttiling_app.py dataset/corpus_txt dataset/corpus_txt_texttiling

By Issue

python execute_clustering_gold_standard_arg.py dataset/corpus_txt dataset/annotations/ -esc -pd  -n 2 3 4 5 6 7 8 -rs 1 2 3 4 5 -jws -e200 embeddings/year1912.clean.txt.fasttext.200.bin -nc 10 11 12 13 14 15 -aaf dataset/corpus_txt_texttiling

All Issues

python execute_clustering_gold_standard_arg.py dataset/corpus_txt dataset/annotations/ -esc -pa  -n 2 3 4 5 6 7 8 -rs 1 2 3 4 5 -jws -e200 embeddings/year1912.clean.txt.fasttext.200.bin -nc 50 51 52 53 54 55 -aaf dataset/corpus_txt_texttiling

Citation

@inproceedings{riedl19:historic_newspaper,
  title = {Clustering-Based Article Identification in Historical Newspapers},
  author = {Riedl, Martin, Daniela Betz and Padó, Sebastian},
  booktitle = {Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature},
  series={LaTeCH-CLfL 2019},
  address = {Minneapolis, USA},
  note = {To appear},
  year = 2019
}

License

This project is licensed under the terms of the Apache 2.0 ASL license. If used for research, citation would be appreciated. The annotation data is published under the permissive CC-BY license.

About

dataset and source code for clustering and detecting newspapers articles

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published