Community Clusters in Web Graphs using PySpark

Contains a bootstrap ETL to generate a parquet file of records from a single WARC file of common crawl web data.

Install modules

All code in this repository uses Python 3.6+ and PySpark 2.3+. First, install dependencies:

pip3 install -r requirements.txt

Directory information

bootstrap: Contains ETL code and the condensed parquet files with URL information
data: Contains results of graph analysis
public: Contains D3 visualizations of web graph clusters

ETL, Analysis and Visualization:

Use commands in RUNNING.md to perform analysis and generate results that are visualized in D3.

Example dataframe created by reading parquet files:

>>> sqlContext = SQLContext(sc)
>>> df = sqlContext.read.parquet("./bootstrap/spark-warehouse/<your-directory>*")

+--------------------+--------------------+-----------+--------------------+-----------+------------+
|              parent|           parentTLD|   childTLD|               child|childDomain|parentDomain|
+--------------------+--------------------+-----------+--------------------+-----------+------------+
|http://1separable...|1separable-43v3r....|twitter.com|http://twitter.co...|    twitter|     skyrock|
|      http://3msk.ru|             3msk.ru|    k--k.ru|http://k--k.ru/85...|       k--k|        3msk|
|      http://3msk.ru|             3msk.ru|    com9.ru|http://com9.ru/85...|       com9|        3msk|
|      http://3msk.ru|             3msk.ru|    com9.ru|http://com9.ru/85...|       com9|        3msk|
|      http://3msk.ru|             3msk.ru| top.vy3.ru|http://top.vy3.ru...|        vy3|        3msk|
+--------------------+--------------------+-----------+--------------------+-----------+------------+
only showing top 5 rows

Interactive Analysis:

To develop a workflow and make intuitive visualizations, use the Jupyter notebook graph_mining.ipynb to interactively query the data in PySpark. Requires the PySpark environment to be configured on the system:

export SPARK_HOME=/home/<user>/spark-2.3.1-bin-hadoop2.7/
export PYSPARK_PYTHON=python3

Open and run the graph_mining.ipynb notebook in a shell that used the above commands, so that the shell knows where to find PySpark on your system.

Name		Name	Last commit message	Last commit date
Latest commit History 126 Commits
bootstrap		bootstrap
data		data
public		public
.gitignore		.gitignore
README.md		README.md
RUNNING.md		RUNNING.md
analysis.py		analysis.py
autoAnalysis.sh		autoAnalysis.sh
graph_mining.ipynb		graph_mining.ipynb
index-template.html		index-template.html
index.html		index.html
plot_communities.py		plot_communities.py
requirements.txt		requirements.txt
run.sh		run.sh

nandita17/community-clusters

Folders and files

Latest commit

History

Repository files navigation

Community Clusters in Web Graphs using PySpark

Install modules

Directory information

ETL, Analysis and Visualization:

Interactive Analysis:

About

Resources

Stars

Watchers

Forks

Languages