Skip to content

sissythem/hate-speech-detection

Repository files navigation

Dev Instructions

Files

Configurations

  • twitter4j.properties: file needed in order to download tweets from the one of the two datasets used
  • log4j.properties: file to configure logger
  • emailConfig-example.properties: rename this file to emailConfig.properties and define your own properties to get notified when execution of the program is finished
  • config.properties:
    • parallel: run folds in parallel
    • numFolds: configuation used in cross validation to define the folds number
    • runs: used only for cross validation classificationType. Defines how many times cross validation will be executed
    • dataset: select -1 to include all texts and run the program as single label supervised learning, otherwise choose only one of the two datasets (put 0 or 1) to select only one dataset and run the program as multi label supervised learning
    • instances: you can either choose "new" to generate new instances or "existing" to use already extracted instances, which will be accessed from arff file
    • pathToInstances: since we have created instances for the merged dataset and for each dataset separately, define from which folder the program will retrieve the instances, e.g. "./instances/singlelabel/". You need to define only this part of the path, since the remaining is the same in all instances folders. The path is associated with the previous field.
    • datasource: you can choose either to access data (texts, features and texts_features) from the database or from csv files
    • vectorFeatures: same here, you can write "new" to re-generate vector features or use "existing" to access them from the database or the csv. In both cases you should first select new in instances (above) field
    • graphFeatures: use true/false in order to generate or not graphFeatures (true is meaningless if you have not chosen new instances)
    • graphType: define if it is ngram or word graph (select true for graphFeatures first)
    • featuresKind: it is related to vector features. One can select "all" or a specific kind (e.g. bow, ngrams etc)
    • instancesToFile: in case you have selected to generate new instances, select true or false to define whether the instances will be exported to file or not
    • Below vector features configurations are used only in case you have selected the "new" option in vectorFeatures field:
      • preprocess: select true or false to define if you want to preprocess your texts
      • stopwords: in case you have selected to preprocess the texts, define if you want to also remove stopwords
      • bow: generate or not bow features
      • word2vec: generate or not word2vec features
      • aggregationType: define the aggregation type for word2vec features (this means that you have selected true in the above field)
      • charngram: generate or not charngram features
      • ngram: generate or not ngram features
      • spelling: generate or not spelling features
      • syntax: generate or not syntax features
    • classificationType: select either "classification" or "crossValidation"
    • Classifiers configuration: define which classifiers will run by selecting true/false in the fields NaiveBayes, LogisticRegression and KNN

Datasets

  • Single Label:
    • HateSpeech: 24463
    • Clean: 14548
    • Total: 39011
  • Multi Label (Racism, Sexism, Clean):
    • Racism: 1910
    • Sexism: 3035
    • Clean: 10543
    • Total: 15488
  • Multi Label (HateSpeech, OffensiveLanguage, Clean):
    • HateSpeech: 1392
    • OffensiveLanguage: 18126
    • Clean: 4005
    • Total: 23523

Classification

  • Problem with KNN: tested for training in Weka GUI, with n=3 training is quick while with n=9 takes time

Metrics

  • F-Measure: calculated using function weightedFMeasure which calculates the average F-Measure.
  • Kappa: metric that compares an Observed Accuracy with an Expected Accuracy (random chance). Used to evaluate a single classifier as well as to evaluate classifiers amongst themselves. Also, takes into account random chance which generally means it is less misleading than simply use accuracy as a metric (an Observed Accuracy of 80% is a lot less impressive with an Expected Accuracy of 75% versus an Expected Accuracy of 50%). Observed Accuracy is simply the number of instances that were classified correctly throughout the entire confusion matrix. Expected Accuracy is defined as the accuracy that any random classifier would be expected to achieve based on the confusion matrix. The Expected Accuracy is directly related to the number of instances of each class

Releases

No releases published

Packages

No packages published

Languages