Skip to content

whym/wikisentiment

Repository files navigation

WikiSentiment

automatic categorization of user interactions in Wikipedia

Homepage

http://github.com/whym/wikisentiment

Contact

http://whym.org

Overview

preprocessing:

  1. For each entry:
    • Extract raw features and put it to a MongoDB :

      {
        "entry" {
          "rev_id":   2894772,
          "title": "Yosri",
          "content": {
            "added": [ "Hi This is ....", ],
            "removed"" []
          },
          "comment": "Hi This is ....",
          "timestamp": "...",
          "sender": {},
          "receiver": {}
        },
        "labels": {
           "debate":  false,
           "other":   false,
           "template": true,
           "welcome"   true,
           "suggest":  true,
           "invite":  false,
           "minor":   false,
           "vandal":  false
        },
        "features": {
          "ngram":   {"type": "assoc", "values": {...}},
          "SentiWN": {"type": "assoc", "values": {...}},
          ...
        }
        "vector": {
          "1": True,
          "2": True,
          "101": True,
          ...
        },
        ...
      }
  2. Convert the raw features into vectors, and update all entries in the MongoDB. (Different selection of features and/or hash kernels may be used here.)
  3. For each entry, add it to the training set.
  4. Train a classifier with the training set.
  5. Output the resulting model.

Testing:

  1. Load the model and construct a classifier.
  2. For each entry, output it and the label predicted by the classifier.

Usage

  1. Obtain a list of revisioin IDs or list of actual messages as CSV.

Requirements

Following python modules are required.

  • urllib2
  • pymongo
  • nltk (wordnet)
  • murmur
  • liblinear, liblinearutil

Todo

  • Support exporting and importing models
  • Efficient pipelining of Wikipedia API call, feature extraction and database insert with producer-consumer style
  • Add a visualization script for error analysis.
  • Support other languages

See also