Skip to content

ecaquot/ML_project2

Repository files navigation

ML_project2

This project was realized during the class of Machine learning in the semester of Autumn 2019 at EPFL. It aims at classifying tweets that used to contain a positive :) smiley or a negative :( smiley.

Members

Code

The code is situated in the project_text_classification folder, there are 3 main python files (.py) :

  • run.py The main file that contains our best model and will produce the best predictions we could get.
  • models.py The implementation of the other models that were not as efficient as the one in `run.py'
  • clean.py Function we defined to help us clean the data during this project

The folder bert contains source code to train bert model.

The code relies on the following libraries: pandas, numpy, nltk, keras, sklearn, gensim, h5py, torch, transformers, tensorflow 1.13.0rc1 You can install them easily with pip.

Other models

We also have 5 files used in one of our models (for GLoVe embedding), they are not part of the best predictions but we upload them for completeness. Those files are :

  • build_vocab.sh
  • cut_vocab.sh
  • pickle_vocab.py
  • cooc.py
  • glove_solution.py

The files should be run in this order, it is explained in models.py.

Data

The data can be found on Aicrowd. Download the data, create a folder name Datasets in project_text_classification and place in it the folder twitter-datasets obtained from unziping twitter-datasets.zip.

You should have the following structure: project_text_classification/Datasets/twitter-datasets/, containing six files :

  • sample_submission.csv
  • test_data.txt The data used to make our predictions on Aicrowd.
  • train_neg_full.txt & train_neg.txt The tweets that used to contain a negative :( smiley. The second file is a subset of the first one.
  • train_pos_full.txt & train_pos.txt The tweets that used to contain a positive :) smiley. The second file is a subset of the first one.

We use train_neg_full.txt, train_pos_full.txt and test_data.txt in our code.

Run

Here are the instructions to reproduce our best predictions.

Prerequisite

You should download the pre-trained model available here. Unzip it and put the folder in project_text_classification/bert/ without changing the folder name (uncased_L-12_H-768_A-12).

Execute

Your current directory should be project_text_classification/ and you can execute python run.py. This will create a file submission.csv which is our best predictions for each entry of the test set. It is composed of a series of 1 and -1, where 1 means that we predict a positive smiley :) and -1 predict negative smiley :(.

DISCLAIMER: might take several hours even with an Nvidia P100

Report

You can find a detailed explanation of our work in report.pdf

About

Project 2 EPFL Machine Learning Course, Fall 2019

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published