WebGraph

Directory Structure

code
- analysis: jupyter notebook files for visualization
- investigate: main files to generate webgraphs and different results
data: folder that could be used to store small data files for testing purposes could copy a few files from a single year and use it as a small sample One can just rename given data_sample directory to data
lib: library folder to install local framework. can include webgraph, apache-maven, aut, and spark
etc: some configuration files that are handy
setup.sh: file to run for initial setup - make sure to read through / definitely not 100% tested

Setup

Log into server : @madmax4.stanford.edu
Clone this directory and cd into it
Run:source setup.sh. This will take a few minutes.

Webgraph

Download the jar and dependencies in lib/webgraph directory
When running java files, specify classpath e.g. java -cp "/dfs/scratch2/dankang/WebGraph/lib/webgraph/*" ....

Verification

Environment variables

Open file at ~/.bashrc.user. It should include updates for: WEBGRAPH_HOME, JAVA_HOME, SPARK_HOME, SPARK_LOCAL_IP, PATH, YARN_CONF_DIR,M2_HOME, AUT_PATH

Spark

cd into WebGraph/lib, and then to spark directory

Call: bin/spark-shell --master local --packages "io.archivesunleashed:aut:0.17.0"

When the spark shell comes up:

Type :paste to get into paste mode
Copy-paste the following code. Change path so that it points to WebGraph/data/example.arc.gz. (likely only change USERID portion) Note that we start with "file://" to specify we are using a local directory.

import io.archivesunleashed._
import io.archivesunleashed.matchbox._

val path = "file:///afs/cs.stanford.edu/u/USERID/WebGraph/data/example.arc.gz"
val r = RecordLoader.loadArchives(path, sc)
.keepValidPages()
.map(r => ExtractDomain(r.getUrl))
.countItems()
.take(10)

Press ctrl + d to execute. The result should look something like: r: Array[(String, Int)] = Array((www.archive.org,132), (deadlists.com,2), (www.hideout.com.br,1))

Pyspark

From lib/aut directory:

pyspark --driver-class-path target/ --py-files target/aut.zip --jars target/aut-0.17.1-SNAPSHOT-fatjar.jar

Once it is loaded, call the following. Again, Change path accordingly as before.

from aut import *

path = "file:///afs/cs.stanford.edu/u/USERID/WebGraph/data/example.arc.gz"
archive = WebArchive(sc, sqlContext, path)
pages = archive.pages()
pages.printSchema()

pages.select(extract_domain("Url").alias("Domain")) \
    .groupBy("Domain").count().orderBy("count", ascending=False).show()

Should return results that are similar to:

+--------------+-----+
|        Domain|count|
+--------------+-----+
|   archive.org|  132|
| deadlists.com|    2|
|hideout.com.br|    1|
+--------------+-----+

Pyspark-submit

Try running : spark-submit ~/WebGraph/code/sample.py

It should print out similar output as above (Pyspark)

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
code		code
data_sample/2003_sample		data_sample/2003_sample
etc		etc
.gitignore		.gitignore
README.md		README.md
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

code

code

data_sample/2003_sample

data_sample/2003_sample

etc

etc

.gitignore

.gitignore

README.md

README.md

setup.sh

setup.sh

Repository files navigation

WebGraph

Directory Structure

Setup

Webgraph

Verification

Environment variables

Spark

Pyspark

Pyspark-submit

About

Releases

Packages

Languages

shashank2000/WebGraph

Folders and files

Latest commit

History

Repository files navigation

WebGraph

Directory Structure

Setup

Webgraph

Verification

Environment variables

Spark

Pyspark

Pyspark-submit

About

Resources

Stars

Watchers

Forks

Languages