Skip to content

Full text classification pipeline with keras and scikit-learn, spaCy for embeddings, seaborn and matplotlib for visualisation. Neural network models include CNN, LSTM, hierarchical attention network.

clavance/classify

Repository files navigation

Project files

This is a summary of how to run the models and visualisations in this project.

We provide the files for running all the models discussed in the project.

  • Download and save the data.csv file.
  • Due to the file size limitations in CATe, we have hosted the file here.
  • It contains 3 columns per line:
    1. "Text": consisting of the raw text samples, extracted from HTML source
    2. "Class": the class label belonging to the text sample
    3. "Cleaned": the preprocessed text samples, for easy access. We discuss the preprocessing steps thoroughly in the project, and also provide the functions for preprocessing in our files.
  • Files provided:
    1. dataset.py: get word counts, visualise dataset
    2. machine_learning_models.py: run machine learning models
    3. deep_learning_models.py: run deep learning models (excl. HAN)
    4. han.py: run hierarchical attention network (HAN) model
    5. decision_boundary_visualisation.py: visualisation for k-NN classifier
    6. extract.py: extract text samples from raw html files
    7. decision_tree.pdf: image of decision tree visualisation
    8. random_forest.pdf: image of random forest visualisation

Libraries

  • tensorflow (machine learning library)
  • keras (ML library, used with Tensorflow backend)
  • pandas (for data handling)
  • spaCy (for various NLP tasks)
  • mlxtend (for some plotting and additional functions, statistical testing)
  • adjustText (to adjust matplotlib plot overlaps)
  • seaborn (for visualisation)
  • matplotlib (for visualisation)
  • beautifulsoup4 (for scraping raw html text files)

Installation

Note that we do not use the latest version of matplotlib (v3.1.1), because of a bug in plotting confusion matrices.

With Python 3.x, using pip:

pip install scikit-learn tensorflow keras pandas spacy mlxtend adjustText seaborn matplotlib==3.1.0 beautifulsoup4
python -m spacy download en_core_web_lg

Usage of dataset.py

Import all dependencies and function definitions in the file.

We acknowledge that the cleanup_text functions and plot_tokens_clean function are based on this kernel.

  • Store data in a pandas dataframe, 'df'
  • Pass in as a string the file location where data.csv has been saved.
df = load_data('/.../data.csv') 
  • load the largest nlp model from spaCy
nlp = spacy.load('en_core_web_lg') 
  • this function takes loaded pandas dataframe as argument
  • prints the counts of each sample in each class
  • plots bar chart of distributions of samples in each class
  • prints counts of tokens/sentences per sample
  • prints bar charts of occurrences of most common words in each class
class_distributions(df)
  • plot a visualisation of spaCy's embeddings of some common words
visualise_embeddings()
  • visualise our dataset with count or TF-IDF vectorizer, with dimensionality reduction to 2D
  • takes in the dataframe and two strings as parameters
  • the first string is the vectorizer ("count" or "tfidf")
  • the second string chooses whether to visualise the full dataset or only the test set ("full" or "test")
visualise_data(df, "count", "full")

Usage of machine_learning_models.py

Import all dependencies and function definitions in the file.

We acknowledge that the plot_confusion_matrix function is based on this article.

  • Store data in a pandas dataframe, 'df'
  • Pass in as a string the file location where data.csv has been saved.
df = load_data('/.../data.csv') 
  • the run_model function takes three parameters, returns a trained classifier, vectorizer, and confusion matrix; prints classification report from sklearn, accuracy and macro-F1
  • parameters:
    1. df - the dataframe we loaded above
    2. vectorizer - the name of the vectorizer (options: "count" or "tfidf")
    3. classifier - the name of the classifier
    • classifier options:
      • "naive_bayes"
      • "decision_tree"
      • "random_forest",
      • "logistic_regression"
      • "linear_svm",
      • "nonlinear_svm"
      • "knn"
      • "mlp"
vectorizer, classifier, cm = run_model(df, "count", "logistic_regression")
  • next, run another classifier if desired
vectorizer2, classifier2, cm2 = run_model(df, "tfidf", "linear_svm")
  • print confusion matrix
  • takes confusion matrix object returned by run_model as argument
  • if normalize=True, prints a normalised confusion matrix
  • change title text string as desired
plot_confusion_matrix(cm, normalize=False, target_names=[i for i in range(1,21)],
 title='Confusion Matrix')
  • run a mcnemar's test if desired (pass in two classifiers returned from run_model)
stat_test(df, classifier1, classifier2)
  • generate plot of logistic regression coefficients
  • pass in classifier and vectorizer objects as arguments
  • only works if passing in a logistic regression classifier
plot_lr_coef(classifier, vectorizer)
  • generate k-NN performance graph used in our report
  • values in this plot are hard-coded from our results
knn_plot()

Usage of deep_learning_models.py

Import all dependencies and function definitions in the file.

  • Store data in a pandas dataframe, 'df'
  • Pass in as a string the file location where data.csv has been saved.
df = load_data('/.../data.csv') 
  • load the largest nlp model from spaCy
  • by default, the spacy model takes in inputs of max length 1,000,000 chars
  • our dataset has max char length 2,871,868, so we need to configure this
nlp = spacy.load('en_core_web_lg')
nlp.max_length = 2871868
  • choose some options
  • if using word embeddings, MAX_LENGTH controls max number of words per sample
  • if using sentence embeddings, MAX_LENGTH controls max number of sentences per sample
MAX_LENGTH = 300 # the max length per sample (choose wisely)
NB_EPOCHS = 50 # number of epochs over which to run. choose some integer, ideally <100
EARLY_STOPPING = True
PATIENCE = 5 # patience for early stopping, if set to True. choose some integer < NB_EPOCHS
  • we run the function to add embeddings from spacy to our dataframe
  • function takes in 3 parameters, returns our df
    1. df - the pandas dataframe containing our data
    2. embedding - the type of embedding (options: "word_embeddings", "sentence_embeddings")
    3. MAX_LENGTH - options: an integer between 1 to 300,000 but due to memory requirements, ideally <=1,000. Note: setting length=1000 already loads ~75GB of embeddings in memory
df = get_embeddings(df, "word_embeddings", MAX_LENGTH)
  • we run the neural network model, function takes in 6 parameters, including those defined above:
    1. df - the pandas dataframe containing our data
    2. architecture - the type of architecture (options: "mlp", "cnn", "ngram_cnn", "lstm", "bi_lstm")
    • note: "ngram_cnn" is based on the CNN implemented in (Kim, 2014) as discussed in our report
    1. MAX_LENGTH
    2. NB_EPOCHS
    3. EARLY_STOPPING
    4. PATIENCE
model, cm = run_model(df, "mlp", MAX_LENGTH, NB_EPOCHS, EARLY_STOPPING, PATIENCE)
  • plot confusion matrix (normalised or not), if desired
plot_confusion_matrix(cm, normalize=False, target_names=[i for i in range(1,21)])

Usage of han.py

We acknowledge that the implementation of the dot_product function and AttentionWithContext layer, and general design of the network, is based on this and that repository.

However, based on those existing implementations, we configured the network to work in our context and for our purposes.

Import all dependencies and function definitions in the file.

  • Store data in a pandas dataframe, 'df'
  • Pass in as a string the file location where data.csv has been saved.
df = load_data('/.../data.csv') 
  • set some parameters which can be chosen
MAX_WORDS = 261737  # number of unique tokens in our training set
MAX_SENTS = 300 # we stick to max. 300 sentences per sample
MAX_SENT_LENGTH = 300 # we stick to max. 300 tokens per sentence
VALIDATION_SPLIT = 0.2 # same training:validation split, 80:20
EMBEDDING_DIM = 300  # we stick to embedding dimensions of 300
  • run the model
model, cm = run_model(df, MAX_WORDS, MAX_SENTS, MAX_SENT_LENGTH, VALIDATION_SPLIT, EMBEDDING_DIM)
  • plot the confusion matrix
plot_confusion_matrix(cm, normalize=False, target_names=[i for i in range(1,21)], title='Confusion Matrix')

Usage of decision_boundary_visualisation.py

Run the entire python script.

We used a list of 20 distinct colours from here.

  • It generates a plot of our k-NN classifier's decision boundaries.
  • The dimensions of our features have been reduced to 2D, so this is not representative of the actual classifiers we trained.
  • In the file there is a variable, vectorizer, which is set to the TfidfVectorizer().
  • Change this to CountVectorizer() if desired.
# vectorizer = CountVectorizer()  
vectorizer = TfidfVectorizer() # choose vectorizer
  • There are also variables, k=1 and weights='distance'.
  • k can be reset to any desired value of k, and the corresponding plot will be generated.
  • weights can be set to 'distance' or 'uniform'.
k = 1 # choose value for k-NN  
# weights = 'uniform'  
weights = 'distance' # choose distance weighting  
classifier = KNeighborsClassifier(n_neighbors=k, weights='distance')
  • run the entire python script once decided.

Usage of extract.py

This file extracts the text samples from the raw HTML files from the original EURLEX dataset. This can be accessed here.

To run this file, first download and unzip all raw HTML files from here. Note the directory in which these files have been saved.

Create a new empty directory where the extracted text files will be stored.

  • Import dependencies
import os
from bs4 import beautifulsoup
  • Initialise variables containing the filepath of the directory where all the raw HTML files are, and the directory where the extracted text files will be stored.
base_dir = "/..." # the location of raw HTML files
second_dir = "/..." # the location of extracted files

Run the rest of the script.

The output of this process is the body text of each file, with a CELEX ID. We then matched the CELEX IDs with the document IDs in this file.

At this stage, we have the body texts, with document IDs. We then matched the document IDs with the top level directory codes, which are the class labels in the file id2class_eurlex_DC_l1.qrels, from this zip file.

Note that the output of this process has been saved in a format which is easy to work with, especially with a pandas dataframe, in the file data.csv.

Acknowledgements

EURLEX dataset modified with permission.

Access the original dataset here.

See the original paper here.

Access the full EUR-Lex repository here.

EUR-Lex data used and modified with permissions and remains the property of:

'© European Union, https://eur-lex.europa.eu, 1998-2019'

About

Full text classification pipeline with keras and scikit-learn, spaCy for embeddings, seaborn and matplotlib for visualisation. Neural network models include CNN, LSTM, hierarchical attention network.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages