Skip to content

remydecoupes/covid19-tweets-mood-tetis

Repository files navigation

Hierarchical_TFIDF applied on covid-19 tweets

This project aims to extract discriminative terms on spatial and time windows. Experimentations are lead on covid-19 on the corpus of tweets created by Emily Chen : https://github.com/echen102/COVID-19-TweetIDs

Pre-requisites :

Tweets have to be download (i.e. hydrated) from Echen repository and indexed into an Elasticsearch index. See steps belows :

  1. Git pull echen directory (https://github.com/echen102/COVID-19-TweetIDs)
  2. Install twarc from pip and configure with a twitter account. Cf : https://github.com/DocNow/twarc
  3. Launch echen hydrate script
  4. Copy all hydrating tweets. There are zipped : find . -name '*.jsonl.gz' -exec cp -prv '{}' 'hydrating-and-extracting' ';'
  5. Unzip all json.gz : gunzip hydrating-and-extracting/coronavirus-tweet
  6. Index in a Elastic Search :
    1. Installation of ELK and plugins :
      • Install ELK : logstash, elasticsearch and kibana
      • Install a plugin for logstash to geocode user location (plugin is for using API Rest): sudo /usr/share/logstash/bin/logstash-plugin install logstash-filter-rest
    2. Start indexing in elastic with logstash :
      • sudo /usr/share/logstash/bin/logstash -f elasticsearch/logstash-config/json.conf
      • /!\ Be carefull if you try to index with a laptop using Wifi, it may power off wlan interface even if you desable sleep mode. If you are using a debian/ubuntu OS, you'll need to disable power management on your wlan interface. => sudo iwconfig wlo1 power off (non permanent on reboot)
    3. (OPTIONAL) : Kibana : you can import dashboard

Run the main script:

The following script allows to :

  • Build a Hiearchical TF-IDF called H-TFIDF over space and time
  • Build classical TF-IDF to compare with
  • Encode both extracted terms from previous measures to compute semantic similarity :

COVID-19-TweetIDS-ES-Analyse.py

More experimentations or methods for evaluate H-TFIDF compared with a classical TF-IDF can be found script and explaination

OPTIONAL Script:

in order to explore the dataset without using elastic search (except from one of them), here are some scripts that allow to have first results :

This is based upon works of:

  • Juan Antonio LOSSIO-VENTURA creator of BioTex
  • Jacques Fize who build a python wrapper of Biotext (see his repository for more details)
  • Gaurav Shrivastava who code FASTR algorithme in python. His script is in this repository

NB : Due to the size of this corpus, biotex could not be launched on the full corpus. It has to be splitt in 30k tweets. Then results have to be merged and ranked

License

This code is provided under the CeCILL-B free software license agreement.

Data Usage Agreement

By using the E.Chen's dataset, as stated by the author you agree to abide by the stipulations in the license, remain in compliance with Twitter’s Terms of Service, and cite the following manuscript:

Chen E, Lerman K, Ferrara E Tracking Social Media Discourse About the COVID-19 Pandemic: Development of a Public Coronavirus Twitter Data Set JMIR Public Health Surveillance 2020;6(2):e19273 DOI: 10.2196/19273 PMID: 32427106

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages