ML_project2

This project was realized during the class of Machine learning in the semester of Autumn 2019 at EPFL. It aims at classifying tweets that used to contain a positive :) smiley or a negative :( smiley.

Members

Code

The code is situated in the project_text_classification folder, there are 3 main python files (.py) :

run.py The main file that contains our best model and will produce the best predictions we could get.
models.py The implementation of the other models that were not as efficient as the one in `run.py'
clean.py Function we defined to help us clean the data during this project

The folder bert contains source code to train bert model.

The code relies on the following libraries: pandas, numpy, nltk, keras, sklearn, gensim, h5py, torch, transformers, tensorflow 1.13.0rc1 You can install them easily with pip.

Other models

We also have 5 files used in one of our models (for GLoVe embedding), they are not part of the best predictions but we upload them for completeness. Those files are :

build_vocab.sh
cut_vocab.sh
pickle_vocab.py
cooc.py
glove_solution.py

The files should be run in this order, it is explained in models.py.

Data

The data can be found on Aicrowd. Download the data, create a folder name Datasets in project_text_classification and place in it the folder twitter-datasets obtained from unziping twitter-datasets.zip.

You should have the following structure: project_text_classification/Datasets/twitter-datasets/, containing six files :

sample_submission.csv
test_data.txt The data used to make our predictions on Aicrowd.
train_neg_full.txt & train_neg.txt The tweets that used to contain a negative :( smiley. The second file is a subset of the first one.
train_pos_full.txt & train_pos.txt The tweets that used to contain a positive :) smiley. The second file is a subset of the first one.

We use train_neg_full.txt, train_pos_full.txt and test_data.txt in our code.

Run

Here are the instructions to reproduce our best predictions.

Prerequisite

You should download the pre-trained model available here. Unzip it and put the folder in project_text_classification/bert/ without changing the folder name (uncased_L-12_H-768_A-12).

Execute

Your current directory should be project_text_classification/ and you can execute python run.py. This will create a file submission.csv which is our best predictions for each entry of the test set. It is composed of a series of 1 and -1, where 1 means that we predict a positive smiley :) and -1 predict negative smiley :(.

DISCLAIMER: might take several hours even with an Nvidia P100

Report

You can find a detailed explanation of our work in report.pdf

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
project_text_classification		project_text_classification
report		report
.gitignore		.gitignore
Project_Guidelines.pdf		Project_Guidelines.pdf
README.md		README.md
project2_description.pdf		project2_description.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

project_text_classification

project_text_classification

report

report

.gitignore

.gitignore

Project_Guidelines.pdf

Project_Guidelines.pdf

README.md

README.md

project2_description.pdf

project2_description.pdf

Repository files navigation

ML_project2

Members

Code

Other models

Data

Run

Prerequisite

Execute

Report

About

Releases

Packages

Contributors 2

Languages

ecaquot/ML_project2

Folders and files

Latest commit

History

Repository files navigation

ML_project2

Members

Code

Other models

Data

Run

Prerequisite

Execute

Report

About

Resources

Stars

Watchers

Forks

Languages