Hao Shu, Gengyi Sun, Han Zhou
In this project, we investigated approaches to accomplish natural language processing and classification. A data set with 100000 comments extracted from Reddit is used to train and validate the model accuracy. Some Python classes were used to help in the feature extraction and model building. Uni-gram Bag-of-words are used as the feature pattern when extracting the features from corpse. After investigation of various classifiers, we have constructed an ensemble of classifiers that’s able to classify comments that acquired from a limited subreddits in Reddit. The result of prediction maintains over 58.7%, with no additional data required. And we achieved the 9th among 98 teams in Kaggle Competition. Classifiers which has a outstanding performance in the classification of matrix produced from natural language has been recorded.
Report available: here