URL Clusterer

Overview

This repository contains an implementation of a Spark pipeline that clusters URLs for any website. The stages of the pipeline and the related Spark job file of each stage is as follows:

Splitting: url_clusterer/job/preprocessing/url_splitter_job.py
Counting words: url_clusterer/job/preprocessing/url_word_count_job.py
Applying Word2vec: url_clusterer/job/preprocessing/word2vec_job.py
Finding clusters by searching for best hyperparameters (using K-means or Bisecting K-means)
- K-means: url_clusterer/job/clustering/kmeans_optimization_job.py
- Bisecting K-means: url_clusterer/job/clustering/bisecting_kmeans_optimization_job.py

To evaluate the clustering result given a ground truth, url_clusterer/job/evaluation/cluster_evaluator_job.py can be used. Then to visualize the resulting clustering, url_clusterer/job/visualization/visualization_job.py can be used.

Input and output data locations, and some of the parameters of the algorithms in the pipeline for these Spark jobs can be configured in config.yml. example_config.yml file is an example configuration file.

Installation of Prerequisites

pip3 install pypandoc pyspark==2.4.6 numpy pandas scikit-learn bokeh hyperopt PyYAML

Usage

To use the pipeline, attributes in the config.yml file must be set. Then splitting, counting words, Word2vec, and one of the optimizations jobs (K-means or Bisecting K-means) must be run respectively. The resulting cluster ids for each URL are written to a location in the file system that is specified in the config.yml file.

More about the Project

See the details about the project from the white paper repository.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
url_clusterer		url_clusterer
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.yml		config.yml
example_config.yml		example_config.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

url_clusterer

url_clusterer

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

config.yml

config.yml

example_config.yml

example_config.yml

Repository files navigation

URL Clusterer

Overview

Installation of Prerequisites

Usage

More about the Project

About

Releases 1

Packages

Languages

License

url-clusterer/URLClusterer

Folders and files

Latest commit

History

Repository files navigation

URL Clusterer

Overview

Installation of Prerequisites

Usage

More about the Project

About

Topics

Resources

License

Stars

Watchers

Forks

Languages