AlanAldaVista

PageRank calculation using Amazon Elastic MapReduce (for Caltech Winter 2015 CS144)

Future students of CS144 should NOT view the contents of this repository or read below this point. Doing so is a direct violation of the honor code.

Algorithm Pseudocode (using Amazon EMR)

Approach

We adopted an Occam’s Razor approach to the problem: more complicated solutions may ultimately prove correct, but the fewer assumptions the better.

Because the data sets for processing were not known in advance, no assumptions about structure were possible and without any such knowledge it was difficult to optimize to take advantage of any specific characteristics of the network graph.

Additionally, the page-ranking computation and map-reduce functions scaled in operations linearly. This was a valuable characteristic that we sought to preserve in the algorithm. For this reason, complicated optimizations were omitted in preference for Pythonic improvements in the code, which would give operational benefits instead of algorithmic benefits.

Tolerance checks to determine the number of iterations were also omitted by relying on a property of convergence defined by the degree of desired error (discussed below). Avoiding a check for tolerance allowed the program to avoid costly computations in sorting which contributed to an optimized run-time.

Technical Considerations

One of the important questions under consideration was “how close to the final PageRanks does the algorithm have to get before the top twenty nodes are in order?” In other words, what degree of error from the true page ranks is tolerable with respect to obtaining an accurate ranking?

To answer this question, the team turned to the heavy-tailed property of network graphs. According to this framework, the page ranks of the nodes would follow a heavy tailed distribution that would obey the “catastrophe principle” - the page rank of the graph would be distributed in much higher concentrations in the higher ranked nodes than would be found in the lower ranked nodes.

Since the order of magnitude of the page ranks of the top twenty nodes would be on the order of 1 or larger, the team concluded that an error of 0.01 (from the final PageRanks) would be sufficient to correctly rank the top twenty nodes as this produces a negligible relative error in the page ranks of the top twenty nodes.

Since the number of iterations executed before declaring convergence partly determines the run time of the program, the natural follow-up question is, “how many iterations are required to obtain an error of 0.01?” Fortunately this question is answered by the following equation: convergence iteration = log(error) / log(alpha).

In the project guideline, alpha is specified as 0.85, so the convergence iteration is about 15.

Contributions

Sean Dolan

Implemented and debugged algorithm in Python
Provided insight on the convergence iteration approximation
Implemented optimizations proposed in brainstorming sessions

Jan Van Bruggen

Responsible for testing the program
Implemented several testing scripts and procedures in Python
Tested procedures and scripts both locally and remotely on AWS
Participated in writing and debugging code

Brad Chattergoon

Assisted in developing the algorithm
Participated in brainstorming sessions
Researched potential optimization implementations for the algorithm
Took lead on constructing the project report

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
boto		boto
data		data
local_test_data		local_test_data
sols		sols
.gitignore		.gitignore
AlanAldaVista_Report.pdf		AlanAldaVista_Report.pdf
LICENSE		LICENSE
README.md		README.md
rankmaniac.py		rankmaniac.py
time_uploader.py		time_uploader.py
uploader.py		uploader.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

boto

boto

data

data

local_test_data

local_test_data

sols

sols

.gitignore

.gitignore

AlanAldaVista_Report.pdf

AlanAldaVista_Report.pdf

LICENSE

LICENSE

README.md

README.md

rankmaniac.py

rankmaniac.py

time_uploader.py

time_uploader.py

uploader.py

uploader.py

Repository files navigation

AlanAldaVista

Algorithm Pseudocode (using Amazon EMR)

Approach

Technical Considerations

Contributions

Sean Dolan

Jan Van Bruggen

Brad Chattergoon

About

Releases

Packages

Languages

License

JanCVanB/alanaldavista

Folders and files

Latest commit

History

Repository files navigation

AlanAldaVista

Algorithm Pseudocode (using Amazon EMR)

Approach

Technical Considerations

Contributions

Sean Dolan

Jan Van Bruggen

Brad Chattergoon

About

Resources

License

Stars

Watchers

Forks

Languages