Mortar Github Recommender

This Mortar project contains the scripts which generate recommendations for gitrec.mortardata.com. This readme details how the algorithm works; for a just a quick overview, see this blog post.

The recommendations are generated by three pigscripts, parse_events.pig, gen_item_recs.pig, and gen_user_recs.pig, which can be called using a controlscript gitrec.py.

To be more generic (and possibly reusable), the code uses the term "item" instead of "repo". Also, if there is a thing that looks like a function call at the beginning of a pig statement, i.e.

user_ids = Utils__AssignIntegerIds(...)

that is actually a macro in the file denoted by the portion before the underscores, i.e. in this case utils.pig.

If you want to run these scripts yourself, you will have to copy Github event logs from the Github Archive into your own S3 bucket, and optionally download a sample to use for local development.

This repo has a script for copying the logs to your S3 bucket; be sure to run it on an EC2 node for high bandwidth to S3; also be sure to run it in a screen session so your ssh doesn't timeout.
If you want to use a sample of the logs for local development, download the sample you want from your S3 bucket and change the path for the parameter EVENT_LOGS_PATH in parse_events.pig to hit your data (it is relative to the pigscripts directory). 5 days should be enough to see sensible results for active repos, but not so much that your laptop cannot handle it.

To run one of the scripts in local mode, use:

mortar local:run pigscripts/the_script.pig

To run it on Mortar, edit the paramfile and use:

mortar register my_gitrec
mortar run pigscripts/the_script.pig -f paramfiles/gitrec-cloud.params -s 11

To run all the scripts together, use controlscripts/gitrec.py in the commands instead of pigscritps/the_script.pig.

Generating A User-Repo Affinity Graph

parse_events.pig parses raw Github event logs into a "user-repo affinity" graph where each edge represents that a user U has some connection to a repo R.

We consider four kinds of signals:

user watches repo
user forks repo
user submits a pull request to a repo
user pushes a commit to a repo

These signals each have three values assigned to them: one for "specific interest" (in our case, pushes and pull requests express this), one for "general interest" (watches and forks express this), and lastly a "graph score" (a combination of both that is used to make the actual graph).

We aggregate each of these scores for each unique user-repo pair. To avoid problems where a hyper-active user who makes many commits to a repo every data messing up the algorithm, we use a logistic scaling function to map each aggregate score to a value in [0, 1] representing how engaged a user is with the repo.

parse_events.pig also extracts the most recent metadata we have available for each repo. Combining the number of forks and stars fields from the metadata (popularity) with the sum of all the user-repo affinity scores for a given repo (activity) gives an overall "importance score" for each repo.

Finding The Neighborhood Of Each Repo

gen_item_recs.pig finds the collection of repos that are most similar to each individual repo, which we call its "neighborhood". It outputs these neighborhoods for use by the subsequent script, but it also joins a copy of them to repo metadata to produce final recommendations for "repos similar to the given repo" (we use DynamoDB to store the recommendations, so the output has to be denormalized).

The first step in generating the neighborhoods is to reduce the user-repo affinities into a repo-repo similarity graph. Each user contributes to a link between repos A and B if they have interacted with both A and B (i.e. forked both). We sum up every users contribution to the A-B link to get a "link score". We use both the "link score" for A-B and the "importance score" for B to estimate the P(user interacts with A | user interacts with B), incorporating a Bayesian prior to guard against small sample sizes. The reciprocal of this probability is said to be the distance of A->B. Note that these distances are not symmetric, that is A->B =/= B->A. Using this distance metric "close" repos are similar, "far" repos are dissimilar.

The reason for choosing to estimate P(A | B) for the A->B edge instead of P(B | A) is a bit counter-intuitive, but this example should help explain: suppose P(twitter/bootstrap | random/repo) = 1, that is every user who likes random/repo also likes twitter/bootstrap. That means there should be random/repo -> twitter/bootstrap should a short distance; but twitter/bootstrap -> random/repo should definitely not be short, since the users they share in common are only a tiny percentage of all the users of twitter/bootstrap.

To get distances for repos that are not directly connected, we follow shortest paths on the repo-repo distance graph up to a maximum path length of 3.

Choosing Recommendations For Each User

gen_user_recs.pig does the last step of the recommender pipeline, choosing which repos to recommend for each user for each of two categories, "specific interest" (similar to those you've contributed to) and "general interest" (similar to those you've looked at). Once it generates it recommendations, it joins them to repo metadata to product the final denormalized output.

Choosing recommendations is simple: consider all the repos in the neighborhoods of repos the user has interacted with in each category (specific/general), then assign an new user-neighbor affinity score based on

the original user-repo affinity score for the repo that led us to the neighbor (called the "reason")
the distance between reason repo and the neighbor repo
the importance score of the neighbor repo. Then take the top N neighbor repos by affinity for each user.

Some care is also taken to avoid the following bad recommendations:

repos the user owns
repos the user has already interacted with
for the "general interest" category, repos which were already recommended as part of the "specific interest" category

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
controlscripts		controlscripts
macros		macros
paramfiles		paramfiles
pigscripts		pigscripts
udfs		udfs
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
NOTICE		NOTICE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

controlscripts

controlscripts

macros

macros

paramfiles

paramfiles

pigscripts

pigscripts

udfs

udfs

.gitignore

.gitignore

LICENSE.txt

LICENSE.txt

NOTICE

NOTICE

README.md

README.md

Repository files navigation

Mortar Github Recommender

Generating A User-Repo Affinity Graph

Finding The Neighborhood Of Each Repo

Choosing Recommendations For Each User

About

Releases

Packages

License

pombredanne/gitrec

Folders and files

Latest commit

History

Repository files navigation

Mortar Github Recommender

Generating A User-Repo Affinity Graph

Finding The Neighborhood Of Each Repo

Choosing Recommendations For Each User

About

Resources

License

Stars

Watchers

Forks