Skip to content

wsgan001/datamining-1

 
 

Repository files navigation

Data mining project

Repo for a data mining project carried out at ITU. Contains pipeline of scripts for scraping and cleaning website meta data, and fetch Alexa website statistics. The cleaned data can readily be used with Rapidminer

Abstract

A small system that collects data from 2425 websites and extracts a total of 44 attributes from each site. We show how a substantial amount of general statistics can be derived, and that we are able to find meaningful clusters in the data. Furthermore, we provide prediction results, which show that it seems unlikely that we can predict the PageRank of a website, from the its intrinsic data alone. Being able to find patterns and statistics in the diverse landscape of the web, is of interest to on-line businesses, and for web statistics in general. With future work, the implications of performing the presented data mining process, could be the display of unique web statistics or the creation of new tools for Business Intelligence.

Links

Related paper (pdf)

About

Data mining repo for the statistico project

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • TeX 70.2%
  • Python 29.8%