Skip to content

mrubash1/gitgraph

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 

Repository files navigation

[GitHub Graph] (http://githubgraph.com/)

Work in Progress

Index

  1. [Introduction] (README.md#1-introduction)
  2. [AWS Clusters] (README.md#2-aws-clusters)
  3. [Data Pipeline] (README.md#3-data-pipeline)
  4. [API] (README.md#4-api)
  5. [Front End] (README.md#5-front-end)
  6. [Presentation] (README.md#6-presentation)

1. Introduction

GitHub Graph is a tool to enable developers to stay updated with the trends on GitHub that they would be interested in. It uses [GitHub Archive] (https://www.githubarchive.org/) and [GitHub API] (https://developer.github.com/v3/users/) as data sources to serve the application. Esentially, it is a big data pipeline focused on answering- "For the users I follow, what are the repositories that those users follow and contribute to".

![Example graph query] (flask/static/img/graph.png)

[GitHub Graph] (http://githubgraph.com/) also shows top 10 weekly trending repositories based on the number of stars.

![Weekly Trending repos] (flask/static/img/trending.png)

Data Sources

  • [GitHub Archive] (https://www.githubarchive.org/): [Ilya Grigorik] (https://www.igvita.com/) started the [GitHub Archive] (https://www.githubarchive.org/) project to record the public GitHub timeline, archive it, and make it easily accessible for further analysis. It has a very simple API to collect data on the granularity of an hour. [GitHub Graph] (http://githubgraph.com/) pulled 850+ GB of data from this source ranging from December 2011 till today. These are clean nested json responses with one event on each line. Events for example are 'WatchEvent', 'ForkEvent', 'CommitCommentEvent', etc.

![Weekly Trending repos] (flask/static/img/publictimeline.png)

An example to collect acttivity for all of January 2015 from [GitHub Archive] (https://www.githubarchive.org/) is:

$ wget http://data.githubarchive.org/2015-01-{01..30}-{0..23}.json.gz

GitHub's API has a rate limit of 5000 calls/hour and around 25 [GitHub API] (https://developer.github.com/v3/users/) access tokens were used to collect data. Thanks to my fellow fellows at Insight and friends in India.

2. AWS Clusters

[GitHub Graph] (http://githubgraph.com/) is powered by three clusters on AWS-

![Clusters] (/flask/static/img/clusters.png)

3. Data Pipeline

![Pipeline] (/flask/static/img/pipeline.png)

Camus is great for the Kafka->HDFS pipeline as it keeps a track of the last offset consumed for a topic and also allows to whitelist and blacklist topics so that one can consume only a subset of topics. Moreover, it also compresses the data before saving it to [HDFS] (http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html) which saves space by an order of magnitude. Camus is very easy to set up and is worth the time spent, however, one important thing to note while setting it up is that the camus jar and log4j.xml must be in HADOOP's path to run camus.

  • Batch Processing

Spark SQL is used here for all the batch processing. For the data from [GitHub Archive] (https://www.githubarchive.org/), events like 'WatchEvent', 'ForkEvent', 'CommitCommentEvent' ae filtered as they are representative of the fact that a user has either contributed to a repository or is following one.

The data from GitHub archive has inconsistent schema. It is filtered in the batch process to a two column schema with a user and a list of all the repositories that he/she has contributed to or is following. Below is an example of schema from 2015 and 2012 and the resulting two column schema.

![Pipeline] (/flask/static/img/inconsistent\ schema.png)

The two column schema form the batch process is used to create a table in Cassndra ![Pipeline] (/flask/static/img/userrepo.png)

GitHub's API returns many fields in its json response for usernames and IDs which are not really important for the application. So, 12GB of raw data for 12M+ users is reduced to 450MB by extrcting login names and IDs in the batch process from the raw responses and is further compressed while consumed by Camus to 122MB.

![Pipeline] (/flask/static/img/users.png)

These usernames are used to pull records of who these users are following to create the Cassndra table below.

![Pipeline] (/flask/static/img/userfollowing.png)

Weekly trends are learned from 'WatchEvents' in GitHub archive and a table with repo names and corresponding watch counts is created.

  • Serving Layer

Cassndra is used to to save the batch results and serve the front end. Three main tables serve the application-

  • Userrepo- Key is the username and value is the list of repos that the user follows and has contributed to.
  • Userfollow - Key is the username and value is the list of usernames of the people who user follows.
  • Weeklytrends - Key is the reponame and value are the watch counts in the past week.

4. API

The API has the following endpoints-

  • Trending
 $ http://52.8.127.252:5000/api/trending
  • Graph
 $ http://52.8.127.252:5000/api/graph/username/year

5. Front end

Flask is use for the web app and D3 to visualize results.

6. Presentation

Presentation for [GitHub Graph] (http://githubgraph.com/) can be found here

I encourage you to play with GitHub Graph and see trends you would be interested in.

About

Insight Data Engineering Project

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 49.8%
  • HTML 30.5%
  • JavaScript 16.9%
  • CSS 1.8%
  • Shell 1.0%