Project files

This is a summary of how to run the models and visualisations in this project.

We provide the files for running all the models discussed in the project.

Download and save the data.csv file.
Due to the file size limitations in CATe, we have hosted the file here.
It contains 3 columns per line:
1. "Text": consisting of the raw text samples, extracted from HTML source
2. "Class": the class label belonging to the text sample
3. "Cleaned": the preprocessed text samples, for easy access. We discuss the preprocessing steps thoroughly in the project, and also provide the functions for preprocessing in our files.
Files provided:
1. dataset.py: get word counts, visualise dataset
2. machine_learning_models.py: run machine learning models
3. deep_learning_models.py: run deep learning models (excl. HAN)
4. han.py: run hierarchical attention network (HAN) model
5. decision_boundary_visualisation.py: visualisation for k-NN classifier
6. extract.py: extract text samples from raw html files
7. decision_tree.pdf: image of decision tree visualisation
8. random_forest.pdf: image of random forest visualisation

Libraries

tensorflow (machine learning library)
keras (ML library, used with Tensorflow backend)
pandas (for data handling)
spaCy (for various NLP tasks)
mlxtend (for some plotting and additional functions, statistical testing)
adjustText (to adjust matplotlib plot overlaps)
seaborn (for visualisation)
matplotlib (for visualisation)
beautifulsoup4 (for scraping raw html text files)

Installation

Note that we do not use the latest version of matplotlib (v3.1.1), because of a bug in plotting confusion matrices.

With Python 3.x, using pip:

pip install scikit-learn tensorflow keras pandas spacy mlxtend adjustText seaborn matplotlib==3.1.0 beautifulsoup4
python -m spacy download en_core_web_lg

Usage of dataset.py

Import all dependencies and function definitions in the file.

We acknowledge that the cleanup_text functions and plot_tokens_clean function are based on this kernel.

Store data in a pandas dataframe, 'df'
Pass in as a string the file location where data.csv has been saved.

df = load_data('/.../data.csv')

load the largest nlp model from spaCy

nlp = spacy.load('en_core_web_lg')

this function takes loaded pandas dataframe as argument
prints the counts of each sample in each class
plots bar chart of distributions of samples in each class
prints counts of tokens/sentences per sample
prints bar charts of occurrences of most common words in each class

class_distributions(df)

plot a visualisation of spaCy's embeddings of some common words

visualise_embeddings()

visualise our dataset with count or TF-IDF vectorizer, with dimensionality reduction to 2D
takes in the dataframe and two strings as parameters
the first string is the vectorizer ("count" or "tfidf")
the second string chooses whether to visualise the full dataset or only the test set ("full" or "test")

visualise_data(df, "count", "full")

Usage of machine_learning_models.py

Import all dependencies and function definitions in the file.

We acknowledge that the plot_confusion_matrix function is based on this article.

Store data in a pandas dataframe, 'df'
Pass in as a string the file location where data.csv has been saved.

df = load_data('/.../data.csv')

the run_model function takes three parameters, returns a trained classifier, vectorizer, and confusion matrix; prints classification report from sklearn, accuracy and macro-F1
parameters:
1. df - the dataframe we loaded above
2. vectorizer - the name of the vectorizer (options: "count" or "tfidf")
3. classifier - the name of the classifier
- classifier options:
  - "naive_bayes"
  - "decision_tree"
  - "random_forest",
  - "logistic_regression"
  - "linear_svm",
  - "nonlinear_svm"
  - "knn"
  - "mlp"

vectorizer, classifier, cm = run_model(df, "count", "logistic_regression")

next, run another classifier if desired

vectorizer2, classifier2, cm2 = run_model(df, "tfidf", "linear_svm")

print confusion matrix
takes confusion matrix object returned by run_model as argument
if normalize=True, prints a normalised confusion matrix
change title text string as desired

plot_confusion_matrix(cm, normalize=False, target_names=[i for i in range(1,21)],
 title='Confusion Matrix')

run a mcnemar's test if desired (pass in two classifiers returned from run_model)

stat_test(df, classifier1, classifier2)

generate plot of logistic regression coefficients
pass in classifier and vectorizer objects as arguments
only works if passing in a logistic regression classifier

plot_lr_coef(classifier, vectorizer)

generate k-NN performance graph used in our report
values in this plot are hard-coded from our results

knn_plot()

Usage of deep_learning_models.py

Import all dependencies and function definitions in the file.

Store data in a pandas dataframe, 'df'
Pass in as a string the file location where data.csv has been saved.

df = load_data('/.../data.csv')

load the largest nlp model from spaCy
by default, the spacy model takes in inputs of max length 1,000,000 chars
our dataset has max char length 2,871,868, so we need to configure this

nlp = spacy.load('en_core_web_lg')
nlp.max_length = 2871868

choose some options
if using word embeddings, MAX_LENGTH controls max number of words per sample
if using sentence embeddings, MAX_LENGTH controls max number of sentences per sample

MAX_LENGTH = 300 # the max length per sample (choose wisely)
NB_EPOCHS = 50 # number of epochs over which to run. choose some integer, ideally <100
EARLY_STOPPING = True
PATIENCE = 5 # patience for early stopping, if set to True. choose some integer < NB_EPOCHS

we run the function to add embeddings from spacy to our dataframe
function takes in 3 parameters, returns our df
1. df - the pandas dataframe containing our data
2. embedding - the type of embedding (options: "word_embeddings", "sentence_embeddings")
3. MAX_LENGTH - options: an integer between 1 to 300,000 but due to memory requirements, ideally <=1,000. Note: setting length=1000 already loads ~75GB of embeddings in memory

df = get_embeddings(df, "word_embeddings", MAX_LENGTH)

we run the neural network model, function takes in 6 parameters, including those defined above:
1. df - the pandas dataframe containing our data
2. architecture - the type of architecture (options: "mlp", "cnn", "ngram_cnn", "lstm", "bi_lstm")
- note: "ngram_cnn" is based on the CNN implemented in (Kim, 2014) as discussed in our report
1. MAX_LENGTH
2. NB_EPOCHS
3. EARLY_STOPPING
4. PATIENCE

model, cm = run_model(df, "mlp", MAX_LENGTH, NB_EPOCHS, EARLY_STOPPING, PATIENCE)

plot confusion matrix (normalised or not), if desired

plot_confusion_matrix(cm, normalize=False, target_names=[i for i in range(1,21)])

Usage of han.py

We acknowledge that the implementation of the dot_product function and AttentionWithContext layer, and general design of the network, is based on this and that repository.

However, based on those existing implementations, we configured the network to work in our context and for our purposes.

Import all dependencies and function definitions in the file.

Store data in a pandas dataframe, 'df'
Pass in as a string the file location where data.csv has been saved.

df = load_data('/.../data.csv')

set some parameters which can be chosen

MAX_WORDS = 261737  # number of unique tokens in our training set
MAX_SENTS = 300 # we stick to max. 300 sentences per sample
MAX_SENT_LENGTH = 300 # we stick to max. 300 tokens per sentence
VALIDATION_SPLIT = 0.2 # same training:validation split, 80:20
EMBEDDING_DIM = 300  # we stick to embedding dimensions of 300

run the model

model, cm = run_model(df, MAX_WORDS, MAX_SENTS, MAX_SENT_LENGTH, VALIDATION_SPLIT, EMBEDDING_DIM)

plot the confusion matrix

plot_confusion_matrix(cm, normalize=False, target_names=[i for i in range(1,21)], title='Confusion Matrix')

Usage of decision_boundary_visualisation.py

Run the entire python script.

We used a list of 20 distinct colours from here.

It generates a plot of our k-NN classifier's decision boundaries.
The dimensions of our features have been reduced to 2D, so this is not representative of the actual classifiers we trained.
In the file there is a variable, vectorizer, which is set to the TfidfVectorizer().
Change this to CountVectorizer() if desired.

# vectorizer = CountVectorizer()  
vectorizer = TfidfVectorizer() # choose vectorizer

There are also variables, k=1 and weights='distance'.
k can be reset to any desired value of k, and the corresponding plot will be generated.
weights can be set to 'distance' or 'uniform'.

k = 1 # choose value for k-NN  
# weights = 'uniform'  
weights = 'distance' # choose distance weighting  
classifier = KNeighborsClassifier(n_neighbors=k, weights='distance')

run the entire python script once decided.

Usage of extract.py

This file extracts the text samples from the raw HTML files from the original EURLEX dataset. This can be accessed here.

To run this file, first download and unzip all raw HTML files from here. Note the directory in which these files have been saved.

Create a new empty directory where the extracted text files will be stored.

Import dependencies

import os
from bs4 import beautifulsoup

Initialise variables containing the filepath of the directory where all the raw HTML files are, and the directory where the extracted text files will be stored.

base_dir = "/..." # the location of raw HTML files
second_dir = "/..." # the location of extracted files

Run the rest of the script.

The output of this process is the body text of each file, with a CELEX ID. We then matched the CELEX IDs with the document IDs in this file.

At this stage, we have the body texts, with document IDs. We then matched the document IDs with the top level directory codes, which are the class labels in the file id2class_eurlex_DC_l1.qrels, from this zip file.

Note that the output of this process has been saved in a format which is easy to work with, especially with a pandas dataframe, in the file data.csv.

Acknowledgements

EURLEX dataset modified with permission.

Access the original dataset here.

See the original paper here.

Access the full EUR-Lex repository here.

EUR-Lex data used and modified with permissions and remains the property of:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

dataset.py

dataset.py

decision_boundary_visualisation.py

decision_boundary_visualisation.py

decision_tree.pdf

decision_tree.pdf

deep_learning_models.py

deep_learning_models.py

extract.py

extract.py

han.py

han.py

machine_learning_models.py

machine_learning_models.py

random_forest.pdf

random_forest.pdf

report.pdf

report.pdf

Repository files navigation

Project files

Libraries

Installation

Usage of dataset.py

Usage of machine_learning_models.py

Usage of deep_learning_models.py

Usage of han.py

Usage of decision_boundary_visualisation.py

Usage of extract.py

Acknowledgements

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
dataset.py		dataset.py
decision_boundary_visualisation.py		decision_boundary_visualisation.py
decision_tree.pdf		decision_tree.pdf
deep_learning_models.py		deep_learning_models.py
extract.py		extract.py
han.py		han.py
machine_learning_models.py		machine_learning_models.py
random_forest.pdf		random_forest.pdf
report.pdf		report.pdf

clavance/classify

Folders and files

Latest commit

History

Repository files navigation

Project files

Libraries

Installation

Usage of dataset.py

Usage of machine_learning_models.py

Usage of deep_learning_models.py

Usage of han.py

Usage of decision_boundary_visualisation.py

Usage of extract.py

Acknowledgements

About

Resources

Stars

Watchers

Forks

Languages