Open source development has produced millions of valuable products and tools that are used by engineers around the world. The success of these projects, however, depends on a large and active community of developers willing to donate their time.
To get a sense of the overall health of this community, take for instance, the period between Jan-May 2016. During this time, there were 3.7 million users who had at least one open source event on github. The probelm is that over the next 5 months 48% of those users made zero contributions.
What is needed is a way to monitor and maintain the health of the community. Code Churn is a tool for identifying users who are likely to churn so that stakeholders, such as project leaders, can prevent it before it happens.
- Google cloud api
- Pyspark
- Pandas
- Flask
- EC2
SELECT
COUNT(type) as event_count,
EXTRACT(MONTH FROM created_at) AS created_at
FROM `githubarchive.year.2018`
WHERE (public)
GROUP BY created_at
ORDER BY created_at;
SELECT
COUNT(DISTINCT(actor.login)) as number_active_users,
EXTRACT(MONTH FROM created_at) AS created_at
FROM `githubarchive.year.2015`
WHERE public
GROUP BY created_at
ORDER BY created_at;
Github event data is archived nightly here. Data are stored in a Google Big Table. The following command run in BigQuery will return count data for all events during the specified period.
/* also get a count of the differt types of commits */
SELECT actor.login as actor,
COUNT(actor) as event_count,
MAX(created_at) as last_event,
MIN(created_at) as first_event,
SUM(CASE WHEN type = 'CheckRunEvent' then 1 else 0 end) as CheckRunEvent_count,
SUM(CASE WHEN type = 'CheckSuiteEvent' then 1 else 0 end) as CheckSuiteEvent_count,
SUM(CASE WHEN type = 'CommitCommentEvent' then 1 else 0 end) as CommitCommentEvent_count,
SUM(CASE WHEN type = 'ContentReferenceEvent' then 1 else 0 end) as ContentReferenceEvent_count,
SUM(CASE WHEN type = 'CreateEvent' then 1 else 0 end) as CreateEvent_count,
SUM(CASE WHEN type = 'DeleteEvent' then 1 else 0 end) as DeleteEvent_count,
SUM(CASE WHEN type = 'DeploymentEvent' then 1 else 0 end) as DeploymentEvent_count,
SUM(CASE WHEN type = 'DeploymentStatusEvent' then 1 else 0 end) as DeploymentStatusEvent_count,
SUM(CASE WHEN type = 'DownloadEvent' then 1 else 0 end) as DownloadEvent_count,
SUM(CASE WHEN type = 'FollowEvent' then 1 else 0 end) as FollowEvent_count,
SUM(CASE WHEN type = 'ForkEvent' then 1 else 0 end) as ForkEvent_count,
SUM(CASE WHEN type = 'ForkApplyEvent' then 1 else 0 end) as ForkApplyEvent_count,
SUM(CASE WHEN type = 'GitHubAppAuthorizationEvent' then 1 else 0 end) asGitHubAppAuthorizationEvent_count,
SUM(CASE WHEN type = 'GistEvent' then 1 else 0 end) as GistEvent_count,
SUM(CASE WHEN type = 'GollumEvent' then 1 else 0 end) as GollumEvent_count,
SUM(CASE WHEN type = 'InstallationEvent' then 1 else 0 end) as InstallationEvent_count,
SUM(CASE WHEN type = 'InstallationRepositoriesEvent' then 1 else 0 end) as InstallationRepositoriesEvent_count,
SUM(CASE WHEN type = 'IssueCommentEvent' then 1 else 0 end) as IssueCommentEvent_count,
SUM(CASE WHEN type = 'IssuesEvent' then 1 else 0 end) as IssuesEvent_count,
SUM(CASE WHEN type = 'LabelEvent' then 1 else 0 end) as LabelEvent_count,
SUM(CASE WHEN type = 'MarketplacePurchaseEvent' then 1 else 0 end) as MarketplacePurchaseEvent_count,
SUM(CASE WHEN type = 'MemberEvent' then 1 else 0 end) as MemberEvent_count,
SUM(CASE WHEN type = 'MembershipEvent' then 1 else 0 end) as MembershipEvent_count,
SUM(CASE WHEN type = 'MilestoneEvent' then 1 else 0 end) as MilestoneEvent_count,
SUM(CASE WHEN type = 'OrganizationEvent' then 1 else 0 end) as OrganizationEvent_count,
SUM(CASE WHEN type = 'OrgBlockEvent' then 1 else 0 end) as OrgBlockEvent_count,
SUM(CASE WHEN type = 'PageBuildEvent' then 1 else 0 end) as PageBuildEvent_count,
SUM(CASE WHEN type = 'ProjectCardEvent' then 1 else 0 end) as ProjectCardEvent_count,
SUM(CASE WHEN type = 'ProjectColumnEvent' then 1 else 0 end) as ProjectColumnEvent_count,
SUM(CASE WHEN type = 'ProjectEvent' then 1 else 0 end) as ProjectEvent_count,
SUM(CASE WHEN type = 'PublicEvent' then 1 else 0 end) as PublicEvent_count,
SUM(CASE WHEN type = 'PullRequestEvent' then 1 else 0 end) as PullRequestEvent_count,
SUM(CASE WHEN type = 'PullRequestReviewEvent' then 1 else 0 end) as PullRequestReviewEvent_count,
SUM(CASE WHEN type = 'PullRequestReviewCommentEvent' then 1 else 0 end) as PullRequestReviewCommentEvent_count,
SUM(CASE WHEN type = 'PushEvent' then 1 else 0 end) as PushEvent_count,
SUM(CASE WHEN type = 'ReleaseEvent' then 1 else 0 end) as ReleaseEvent_count,
SUM(CASE WHEN type = 'RepositoryEvent' then 1 else 0 end) as RepositoryEvent_count,
SUM(CASE WHEN type = 'RepositoryImportEvent' then 1 else 0 end) as RepositoryImportEvent_count,
SUM(CASE WHEN type = 'RepositoryVulnerabilityAlertEvent' then 1 else 0 end) as RepositoryVulnerabilityAlertEvent_count,
SUM(CASE WHEN type = 'SecurityAdvisoryEvent' then 1 else 0 end) as SecurityAdvisoryEvent_count,
SUM(CASE WHEN type = 'StatusEvent' then 1 else 0 end) as StatusEvent_count,
SUM(CASE WHEN type = 'TeamEvent' then 1 else 0 end) as TeamEvent_count,
SUM(CASE WHEN type = 'TeamAddEvent' then 1 else 0 end) as TeamAddEvent_count,
SUM(CASE WHEN type = 'WatchEvent' then 1 else 0 end) as WatchEvent_count,
FROM (
SELECT public, type, repo.name, actor.login, created_at,
JSON_EXTRACT(payload, '$.action') as event,
FROM (TABLE_DATE_RANGE([githubarchive:day.] ,
TIMESTAMP('2018-01-01'),
TIMESTAMP('2018-06-01')
))
WHERE public #(type = 'IssuesEvent' AND public)
)
GROUP by actor;
After the query finishes, the resulting table must be exported to csv in google cloud storage. To do that:
-
In the BigQuery terminal, select "Job information" tab. Scroll down and select "Temporary Table". On the right side, select the "Export" tab and select "export to GCS". Choose save as location, file type and compression. In the case of tables larger than about 500 MB, the file name should end in
*.csv
. The**
wildcard will instruct google cloud to break the table up over multiple files. -
Download each csv file from GCS.
-
Collect files into data directory.
-
csvstack events_2018_01_01_2018_06_01* > events_02_02_2018_06_01.csv
Install csvkit for manipulating multiple csv files.
sudo pip install csvkit
Data_fetch.ipynb
Merge csv files using csvkit.
csvstack user_batch_* > all_users.csv
for old in *; do mv $old `basename $old `.csv; done;
pip install eurekatrees
eurekatrees --trees ./trees/rf_tree.txt
From StackOverflow:
curl https://sdk.cloud.google.com | bash
- restart shell:
exec -l $SHELL
- Copy pem file:
scp -i <keypair> myfile.txt ubuntu@ec2-x-x-x.com:
The colon at the end will ensure that it is copied to the home directory. Specify other locations directly after ec2-x-x-x.com/path/to/dir/. - link account:
gcloud auth activate-service-account --key-file=$PATH TO KEY FILE
gcloud init
(please make proper selection as per your requirement).- Install python 2.7 if not already present:
sudo apt install python2.7
- After above steps try running gsutil ls OR bq ls command and see if buckets and datasets are listed properly.
pip install google-cloud-storage
Install:
-
wget https://repo.continuum.io/archive/Anaconda3-5.0.1-Linux-x86_64.sh
-
bash Anaconda3-5.0.1-Linux-x86_64.sh
-
conda list --explicit > spec-file.txt
-
spc
-
conda create --name myenv --file spec-file.txt