Data-Mining-Techniques-Project

Problem Definition

This Group Project was implemented as part of the course: Data Mining Techniques offered by the M.Sc in Data Science of the Department of Informatics, School of Sciences, National and Kapodistrian University of Athens.

The problems that this repository is called to solve are:

The classification of different articles in their respective categories
Duplicates detection
Wordcloud creation for each category

Dataset

The dataset that is used can be found in train_set.zip. It contains 12265 articles, both title and content and also the category in which each one belongs.

Processing & Classifiers

In order to apply the classifiers (SVM and Random Forest) the articles were vectorized by using in each time one of the Tf-Idf-Vectorizer, the Single Value Decomposition or the Average Word Vector (Word2Vector). The stop-words were removed and the Snowball Stemmer was applied. Each classifier was used on each vectorization.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
classification_all.py		classification_all.py
duplicates.py		duplicates.py
train_set.rar		train_set.rar
wordcloud_creation.py		wordcloud_creation.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

classification_all.py

classification_all.py

duplicates.py

duplicates.py

train_set.rar

train_set.rar

wordcloud_creation.py

wordcloud_creation.py

Repository files navigation

Data-Mining-Techniques-Project

Problem Definition

Dataset

Processing & Classifiers

About

Releases

Packages

Languages

aronis92/Data-Mining-Techniques-Project

Folders and files

Latest commit

History

Repository files navigation

Data-Mining-Techniques-Project

Problem Definition

Dataset

Processing & Classifiers

About

Resources

Stars

Watchers

Forks

Languages