Skip to content

Different techniques to measure the quality of Wikipedia

License

Notifications You must be signed in to change notification settings

singla-excelsior/wikipedia_analysis

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

wikipedia_analysis

This repository contains my code to predict quality class of Wikipedia articles.

You should find the code in R file in analysis directory.

Data set

The data is stored in all_data.tsv file.

The data set contains information of ~ 20 000 Wikipedia articles, collected through Wikipedia projects.

Running the code

You should have R installed. I suggest that you should also use RStudio as the IDE, but it is optional.

Please note that the code is tested with R 3.2.3

These following packages are required:

  • caTools
  • rpart
  • class
  • h2o

First, you should load the code

setwd ("path to AnalyzeData.R file")
source ("AnalyzeData.R")

Then you can run the following analysis.

Linear regression

The linear regression is done by calling the function runRegression.

CART

The CART model is done by calling the function runCART.

kNN

The function for kNN model is runKNNModel.

Multinominal logistic regression

The predictor using multinominal logistic regression could be called with the function runMultinominalLogisticRegression

The function requires packages caret and nnet.

SVM

Packages required: caret and e1071

Function name: runSVM

random forest

We provided two functions for randomForest model.

The first function is runRFModel, which will load and run the data with readability scores using k-fold (with k = 5)

The second function is runRFModel_withoutReadabilityScore, which will run without using readability scores, as in [1].

We applied 5-folds cross validation.

You should observe that the first function provide a better prediction.

Utilities

We provided some other utility functions such as calculate RMSE or NDCG.

[1] Warncke-Wang, M., Ayukaev, V.R., Hecht, B. and Terveen, L.G., 2015, February. The Success and Failure of Quality Improvement Projects in Peer Production Communities. In Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work & Social Computing (pp. 743-756). ACM.

About

Different techniques to measure the quality of Wikipedia

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 83.9%
  • R 15.8%
  • Shell 0.3%