Skip to content

Fabio752/Decision_Trees

Repository files navigation

CO395 Introduction to Machine Learning: Coursework 1 (Decision Trees)

Introduction

This repository contains the skeleton code and dataset files that you need in order to complete the coursework.

Data

The data/ directory contains the datasets you need for the coursework.

The primary datasets are:

  • train_full.txt
  • train_sub.txt
  • train_noisy.txt
  • validation.txt

Some simpler datasets that you may use to help you with implementation or debugging:

  • toy.txt
  • simple1.txt
  • simple2.txt

The official test set is test.txt. Please use this dataset sparingly and purely to report the results of evaluation. Do not use this to optimise your classifier (use validation.txt for this instead).

Codes

  • classification.py

    • Contains the skeleton code for the DecisionTreeClassifier class. Your task is to implement the train() and predict() methods.
  • eval.py

    • Contains the skeleton code for the Evaluator class. Your task is to implement the confusion_matrix(), accuracy(), precision(), recall(), and f1_score() methods.
  • example_main.py

    • Contains an example of how the evaluation script on LabTS might use the classes and invoke the methods defined in classification.py and eval.py.

Instructions

  • The project contains some files for visualisation and data analysis purposes as well as the required ones.

    • Required files

      • classification.py:

        • ClassifierTreeStats: a class storing statistics of the ClassifierTree (node count, leaf count, etc.)

        • ClassifierTree: a class storing the decision tree with member functions:

          • __init__: initialise it by passing:
            • dataset: the dataset to classify.
            • splitObject: a splitObject object.
            • treeStats: a TreeStats object.
            • depth (optional, default = 0):
            • parent (optional, default = None):
          • buildTree: (no arguments taken), just builds the structure
          • predict: takes one argument, returns prediction:
            • attrib: one set of attributes to predict.
          • __repr__: takes one optional argument, returns text-based visualisation of the tree
            • maxDepth (optional, default = None): max depth for the visualisation
        • DecisionTreeClassifier: A class for the making a decision tree classifier object. Has an attribute is_trained: bool that keeps track of whether the classifier has been trained. It has also the following methods:

          • train: this method constructs the classifier from the data. It takes in arguments:
            • x: numpy.array (N x K) where N is the number of instances and K the number of attributes.
            • y: numpy.array (N x 1) storing the outcomes.
          • predict: this method takes one argument as input and predicts the outcomes from the given samples returning a (N x 1) numpy.array. It assumes the classifier has already been trained. It takes in an argument:
            • x: numpy.array (N x K) where N is the number of instances and K the number of attributes.
          • __repr__: takes one optional argument, returns a text-based visualisation of the tree
            • maxDepth (optional, default = None): max depth for the visualisation
      • dataset.py:

        • ClassifierDataset: a class that contains the dataset and has member functions to calculate best split for a given range of data.
          • initFromFile: a function that takes as input a path to a file (pathToFile) and reads the file.
          • initFromData: takes two parameters (attrib and labels) that allow it to instantiate the object from the given data.
          • Other functions for computing the best split while building the tree are included in the class.
      • eval.py:

        • Evaluator: this class has several methods that can be called to evaluate the classifier predicitions based on various metrics.
          • confusion_matrix: Computes the confusion matrix on the given classifier. Has the following parameters:
            • prediction: np.array containing the predicted class labels
            • annotation: np.array containing the ground truths
          • class_labels: np.array containing the ordered set of class labels. If not provided, default value will be the unique values in annotation.
          • accuracy: calculates accuracy given the confusion matrix.
          • precision: calculates precision given the confusion matrix.
          • recall: calculates recall given the confusion matrix.
          • f1_score: calculates f1_score given the confusion matrix.
      • prune.py:

        • Prune: a class prunes the decision tree upon initialisation.
          • __init__:
            • decisionTreeClassifier: an object representing the tree to prune
            • validationAttrib: the attributes for the validation set.
            • validationLabel: the labels for the validation set.
            • aggressive: boolean (optional, default = false). Pruning aggressively means prune even when the accuracy after pruning stays the same.
      • visualise.py:

        • TreeVisualiser: a class that, upon initialisation, plots a image-based visualisation of a tree.
          • __init__:
            • decisionTreeClassifier: the decision tree to print.
            • maxPlotDepth: int value to indicate the depth level on which to stop the printing (optional, default = None).
            • compact: boolean (optional, default = false) that enables compact mode.
            • filename: the name of the output file (optional, default = visualiser_output).
            • format: the format of the output file (supports svg, jpg, png or pdf). (optional, default = svg).
      • k_fold.py:

        • k_fold_validator: a class that performs k-fold cross-validation given k and a dataset
          • __init__:
            • dataset: the dataset, initialised using the path to the file.
            • k: the value of k.
          • split_dataset: this method splits the rows of the dataset into k different folds. It then generates two arrays (stored as member variables), one containing test indices and the other containing the corresponding train indices. This method takes no arguments.
          • perform_validation: this method uses the previously generated arrays to train and test k different decision tree models. The accuracy of each model is stored as an element of an array (member variable). The method returns the average accuracy score of the k different models, as well as the standard deviation from that average score. This method takes no arguments.
          • test_best_model: this method finds the model with the highest accuracy from the perform_validation function (by checking the saved accuracy scores) and tests it on the full test dataset. This method takes 1 argument:
            • test_path: path to the full test set
          • plot_confusion_matrix: uses matplotlib to plot a confusion matrix and save the figure. Takes 5 arguments:
            • cm: the confusion matrix to plot
            • target_names: names of the (ground truth) classes
            • title: for the title of the plot, as well as the name to save the figure by
            • cmap (optional, default = None): the colour map of the plot
            • normalize (optional, default = False): boolean indicating whether or not to normalize the values of the confusion matrix along its rows
    • Non-required files

      • __main_eval_test.py :

        • The purpose of this file is to generate the confusion matrix, accuracy and calculate macro average recall, precision and f1 for each training.set.
      • __main_prune.py :

        • The purpose of this file is to determine unpruned and pruned accuracy on input datasets, number of nodes pruned (as well as number of parent leaves) and decreasing in the tree's max depth.
      • __main_draw.py :

        • The purpose of this file is to generate a pdf file to visualise the tree (pruned or unpruned).
      • __profiler.py :

        • The purpose of this file is to generate a table to assess the execution time for training different datasets.
        • The number of test samples can be changed in the file.
      • __main_analysis.py:

        • The purpose of this file is to generate a table to assess the execution time for training different datasets.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages