This Group Project was implemented as part of the course: Data Mining Techniques offered by the M.Sc in Data Science of the Department of Informatics, School of Sciences, National and Kapodistrian University of Athens.
The problems that this repository is called to solve are:
- The classification of different articles in their respective categories
- Duplicates detection
- Wordcloud creation for each category
The dataset that is used can be found in train_set.zip. It contains 12265 articles, both title and content and also the category in which each one belongs.
In order to apply the classifiers (SVM and Random Forest) the articles were vectorized by using in each time one of the Tf-Idf-Vectorizer, the Single Value Decomposition or the Average Word Vector (Word2Vector). The stop-words were removed and the Snowball Stemmer was applied. Each classifier was used on each vectorization.