Hierarchical_TFIDF applied on covid-19 tweets

This project aims to extract discriminative terms on spatial and time windows. Experimentations are lead on covid-19 on the corpus of tweets created by Emily Chen : https://github.com/echen102/COVID-19-TweetIDs

Pre-requisites :

Tweets have to be download (i.e. hydrated) from Echen repository and indexed into an Elasticsearch index. See steps belows :

Git pull echen directory (https://github.com/echen102/COVID-19-TweetIDs)
Install twarc from pip and configure with a twitter account. Cf : https://github.com/DocNow/twarc
Launch echen hydrate script
Copy all hydrating tweets. There are zipped : find . -name '*.jsonl.gz' -exec cp -prv '{}' 'hydrating-and-extracting' ';'
Unzip all json.gz : gunzip hydrating-and-extracting/coronavirus-tweet
Index in a Elastic Search :
1. Installation of ELK and plugins :
  - Install ELK : logstash, elasticsearch and kibana
  - Install a plugin for logstash to geocode user location (plugin is for using API Rest): sudo /usr/share/logstash/bin/logstash-plugin install logstash-filter-rest
2. Start indexing in elastic with logstash :
  - sudo /usr/share/logstash/bin/logstash -f elasticsearch/logstash-config/json.conf
  - /!\ Be carefull if you try to index with a laptop using Wifi, it may power off wlan interface even if you desable sleep mode. If you are using a debian/ubuntu OS, you'll need to disable power management on your wlan interface. => sudo iwconfig wlo1 power off (non permanent on reboot)
3. (OPTIONAL) : Kibana : you can import dashboard

Run the main script:

The following script allows to :

Build a Hiearchical TF-IDF called H-TFIDF over space and time
Build classical TF-IDF to compare with
Encode both extracted terms from previous measures to compute semantic similarity :

COVID-19-TweetIDS-ES-Analyse.py

More experimentations or methods for evaluate H-TFIDF compared with a classical TF-IDF can be found script and explaination

OPTIONAL Script:

in order to explore the dataset without using elastic search (except from one of them), here are some scripts that allow to have first results :

Dataset Analysis : some stats computed on Echen corpus
Extractor : script extracting only tweets'contents (without RT) in order to share data without all twitter's verbose API
And a Pipeline for terms extraction using biotex :
1. preprocess : cleaning up tweets and building corpus in the biotex syntaxe
2. biotex-wrapper: An automatisation of biotex on 4 settings
3. merge biotex results: Due to the size of this corpus, biotex could not be launched on the full corpus. It has to be splitt in 30k tweets. Then results have to be merged and ranked
Other fonctions to explore but which use Elasticsearch

This is based upon works of:

Juan Antonio LOSSIO-VENTURA creator of BioTex
Jacques Fize who build a python wrapper of Biotext (see his repository for more details)
Gaurav Shrivastava who code FASTR algorithme in python. His script is in this repository

NB : Due to the size of this corpus, biotex could not be launched on the full corpus. It has to be splitt in 30k tweets. Then results have to be merged and ranked

License

This code is provided under the CeCILL-B free software license agreement.

Data Usage Agreement

By using the E.Chen's dataset, as stated by the author you agree to abide by the stipulations in the license, remain in compliance with Twitter’s Terms of Service, and cite the following manuscript:

Chen E, Lerman K, Ferrara E Tracking Social Media Discourse About the COVID-19 Pandemic: Development of a Public Coronavirus Twitter Data Set JMIR Public Health Surveillance 2020;6(2):e19273 DOI: 10.2196/19273 PMID: 32427106

Name		Name	Last commit message	Last commit date
Latest commit History 217 Commits
biotex-wrapper-fize		biotex-wrapper-fize
elasticsearch		elasticsearch
exploration_data_analyse		exploration_data_analyse
fastr		fastr
readme_ressources		readme_ressources
.gitignore		.gitignore
.gitmodules		.gitmodules
COVID-19-TweetIDS-ES-Analyse.py		COVID-19-TweetIDS-ES-Analyse.py
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

biotex-wrapper-fize

biotex-wrapper-fize

elasticsearch

elasticsearch

exploration_data_analyse

exploration_data_analyse

fastr

fastr

readme_ressources

readme_ressources

.gitignore

.gitignore

.gitmodules

.gitmodules

COVID-19-TweetIDS-ES-Analyse.py

COVID-19-TweetIDS-ES-Analyse.py

LICENSE

LICENSE

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

Hierarchical_TFIDF applied on covid-19 tweets

Pre-requisites :

Run the main script:

OPTIONAL Script:

License

Data Usage Agreement

About

Releases

Packages

Languages

License

remydecoupes/covid19-tweets-mood-tetis

Folders and files

Latest commit

History

Repository files navigation

Hierarchical_TFIDF applied on covid-19 tweets

Pre-requisites :

Run the main script:

OPTIONAL Script:

License

Data Usage Agreement

About

Resources

License

Stars

Watchers

Forks

Languages