NLP - Cross lingual offensive language identification

authors: Gojko Hajdukovic, Simon Dimc, 05.2021

Table of contents:

Setup
Usage

Setup

These instructions assume that the user is in repo's root.

cd <repo_root>

In order to set-up virtual environment issue:

python -m venv venv
#Activate the environment
source venv/bin/activate

To install all project related dependencies issue:

pip3 install -r requirements.txt
python -m spacy download en_core_web_sm

Get datasets: Datasets are in folder data/source_data. Get datasets from following sources and put them into folders:
- data/source_data/eng/binary/dataset_1
  source: https://github.com/sjtuprog/fox-news-comments
  You will have to parse the json file fox-news-comments.json and convert it into a data.csv file with format: Label:Text.
- data/source_data/eng/binary/dataset_2
  source: https://github.com/jing-qian/A-Benchmark-Dataset-for-Learning-to-Intervene-in-Online-Hate-Speech
  reddit data
  Rename file to data.csv.
- data/source_data/eng/binary/dataset_3
  source: https://github.com/jing-qian/A-Benchmark-Dataset-for-Learning-to-Intervene-in-Online-Hate-Speech
  gab data
  Rename file to data.csv.
- data/source_data/eng/binary/dataset_4
  source: https://github.com/Vicomtech/hate-speech-dataset
  Copy following folders and files.
  folders: all_files, sampled_test
  files: annotations_metadata.csv
- data/source_data/eng/multiclass/dataset_5
  source: https://github.com/mayelsherif/hate_speech_icwsm18
  You will have to either download the tweets using the provided Tweet IDs or contact the authors. Put the tweets in csv files in format tweet_id,tweet, inside a downloaded_tweets_dataset folder. Name of the csv files should be the same as in the provided filenames with Tweet Ids.
- data/source_data/eng/multiclass/dataset_6
  source: https://github.com/Mrezvan94/Harassment-Corpus
  You will have to contact the authors. Put the csv files inside a tweets_dataset folder.
- data/source_data/slo/multiclass/dataset_2
  source: https://www.clarin.si/repository/xmlui/handle/11356/1398
  You will have to either download the tweets using the provided Tweet IDs or contact the authors. You will have to parse the data into a data.csv file with format: Text,Class,Type.
Get models:
- CroSloEngual BERT pre-trained model
  source: https://www.clarin.si/repository/xmlui/handle/11356/1330
  Put config.json, pytorch_model.bin, and vocab.txt inside classifiers/bert/CroSloEngual.
- Fine-tuned CroSloEngual BERT models
  source: https://drive.google.com/drive/folders/1j2BJ-X0WdNpxDFJHrmm-DsYuy1GeFb03?usp=sharing
  Put bert/binary.pt and bert/multiclass.pt inside models/bert.

Usage

The project is structured to implement multiple classifiers for two classification tasks, a binary and multiclass. In order to reproduce results from the report a CLI application has been implemented. Following instructions assume that the user is in project's root.

In order to run CLI application with help description issue:

python main.py --help

Examples:

python main.py --prepareData true --type multi --model LR
python main.py -pd false -t bi -m BERT

For BERT fine-tuning you can use the notebooks/bert-notebook.ipynb notebook for Google Colab.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

classifiers

classifiers

data

data

models

models

notebooks

notebooks

preprocess

preprocess

report

report

utils

utils

.gitignore

.gitignore

README.md

README.md

main.py

main.py

requirements.txt

requirements.txt

Repository files navigation

NLP - Cross lingual offensive language identification

authors: Gojko Hajdukovic, Simon Dimc, 05.2021

Setup

Usage

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
classifiers		classifiers
data		data
models		models
notebooks		notebooks
preprocess		preprocess
report		report
utils		utils
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

ghajduk3/COLI

Folders and files

Latest commit

History

Repository files navigation

NLP - Cross lingual offensive language identification

authors: Gojko Hajdukovic, Simon Dimc, 05.2021

Setup

Usage

About

Resources

Stars

Watchers

Forks

Languages