This project aims to extract discriminative terms on spatial and time windows. Experimentations are lead on covid-19 on the corpus of tweets created by Emily Chen : https://github.com/echen102/COVID-19-TweetIDs
Tweets have to be download (i.e. hydrated) from Echen repository and indexed into an Elasticsearch index. See steps belows :
- Git pull echen directory (https://github.com/echen102/COVID-19-TweetIDs)
- Install twarc from pip and configure with a twitter account. Cf : https://github.com/DocNow/twarc
- Launch echen hydrate script
- Copy all hydrating tweets. There are zipped : find . -name '*.jsonl.gz' -exec cp -prv '{}' 'hydrating-and-extracting' ';'
- Unzip all json.gz :
gunzip hydrating-and-extracting/coronavirus-tweet
- Index in a Elastic Search :
- Installation of ELK and plugins :
- Install ELK : logstash, elasticsearch and kibana
- Install a plugin for logstash to geocode user location (plugin is for using API Rest):
sudo /usr/share/logstash/bin/logstash-plugin install logstash-filter-rest
- Start indexing in elastic with logstash :
sudo /usr/share/logstash/bin/logstash -f elasticsearch/logstash-config/json.conf
- /!\ Be carefull if you try to index with a laptop using Wifi, it may power off wlan interface even if you desable sleep mode. If you are using a debian/ubuntu OS, you'll need to disable power management on your wlan interface. =>
sudo iwconfig wlo1 power off
(non permanent on reboot)
- (OPTIONAL) : Kibana : you can import dashboard
- Installation of ELK and plugins :
The following script allows to :
- Build a Hiearchical TF-IDF called H-TFIDF over space and time
- Build classical TF-IDF to compare with
- Encode both extracted terms from previous measures to compute semantic similarity :
COVID-19-TweetIDS-ES-Analyse.py
More experimentations or methods for evaluate H-TFIDF compared with a classical TF-IDF can be found script and explaination
in order to explore the dataset without using elastic search (except from one of them), here are some scripts that allow to have first results :
- Dataset Analysis : some stats computed on Echen corpus
- Extractor : script extracting only tweets'contents (without RT) in order to share data without all twitter's verbose API
- And a Pipeline for terms extraction using biotex :
- preprocess : cleaning up tweets and building corpus in the biotex syntaxe
- biotex-wrapper: An automatisation of biotex on 4 settings
- merge biotex results: Due to the size of this corpus, biotex could not be launched on the full corpus. It has to be splitt in 30k tweets. Then results have to be merged and ranked
- Other fonctions to explore but which use Elasticsearch
This is based upon works of:
- Juan Antonio LOSSIO-VENTURA creator of BioTex
- Jacques Fize who build a python wrapper of Biotext (see his repository for more details)
- Gaurav Shrivastava who code FASTR algorithme in python. His script is in this repository
NB : Due to the size of this corpus, biotex could not be launched on the full corpus. It has to be splitt in 30k tweets. Then results have to be merged and ranked
This code is provided under the CeCILL-B free software license agreement.
By using the E.Chen's dataset, as stated by the author you agree to abide by the stipulations in the license, remain in compliance with Twitter’s Terms of Service, and cite the following manuscript:
Chen E, Lerman K, Ferrara E Tracking Social Media Discourse About the COVID-19 Pandemic: Development of a Public Coronavirus Twitter Data Set JMIR Public Health Surveillance 2020;6(2):e19273 DOI: 10.2196/19273 PMID: 32427106