Skip to content

dSar-UVA/repoMiner

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GitCProc

GitCProc is a tool to extract changes to elements in code and associate them with their surrounding functions. It takes git diff logs as input and produces a table mapping the additions and deletions of the elements to a 4-tuple (project, sha, file, function/method) specifying the commit and location in the code of changed elements. It also analyzes commit messages to estimate if the commits were bug fixing or not.

Currently, we have designed it to work with C/C++/Java/Python, but we have designed the framework to be extensible to other languages with the concept of scope.

#Video Demo

There is a video walkthrough of running the tool on a simple example at: https://youtu.be/shugzDjxj0w

#Required Libraries GitCProc runs on python 2.7 and requires the following libraries: -psycopg2 -nltk -PyYAML -GitPython

The python script src/logChunk/installDependencies.py will install these for you. Be aware that psycopg2 requires postgres to be installed on your machine with necessary supporting libraries. For Ubuntu, make sure you have libpq-dev installed.

If you wish to output your results to a database instead of a csv file, you will need a postgres server installed.

#Running Instructions

The master script is located at src/logChunk/gitcproc.py, which depends on 3 input files to run.

  1. First, if you want to download your projects from GitHub, you need a file to specify the list of projects from GitHub. The file format is a list of full project names, e.g. caseycas/gitcproc, one on each line.

  2. The keywords file specifies what structures you wish to track. It is formatted in the following manner:


[keyword_1], [INCLUDED/EXCLUDED], [SINGLE/BLOCK]
...
[keyword_N],[INCLUDED/EXCLUDED], [SINGLE/BLOCK]

The first part is the keyword you wish to track. For instance, if you are tracking assert statements, one keyword of interest would be "assert", if you were tracking try-catch blocks, you might have "try" and "catch". The second part specifies whether the keyword is being searched for (INCLUDED) or specifically ignored (EXCLUDED). For example, if you are tracking assertions, but you wish to exclude headers, you would have a line like "assert.h,EXCLUDED,SINGLE". The last part specifies whether the keyword refers to a single line coding element or a multiple line element. A single line element might be "println", where as "for" with a BLOCK designation will track all the contents of the for loop encased by {...}.

If you want to exactly match a keyword, include "" around the keyword and it will match words where their is a break in alphanumeric characters. So "assert",INCLUDED,SINGLE would match assert(...), but not ASSERTTRUE(...). If you want less exact matching, do not include quotations.

  1. Finally, you need a config.ini file (several examples are included in src/util). The sections of the config.ini file are as follows:

[Database]

  • Here you list your database name, your user name, the host, port, schema, and the names for the two tables created by the tool. "table_method_detail" outputs information specific to changes to methods/functions, where "table_change_summary" specifies information about each commit processed.

[Repos]

  • Here you must specify your repo_url_file, which is described in 1) and the repo_locations:, which is where you want to store your downloaded projects.

[Keywords]

  • Here you must specify your keywords file, described in 2) under the option file:

[Log]

  • This section has 1 option, languages:, where you specify which file types to include in the log, currently, we support C, C++, Java, and Python file types.

[Flags]

  • Finally, this section contains several output and debugging options: SEP: -> A string used to flatten out downloaded project names. We recommend using ___ as we have not observed GitHub projects using this in their names. To illustrate what it does, if this tool downloads caseycas/gitcproc with SEP: "___" the directory name it will be saved to is caseycas___gitcproc. DEBUG: -> True or False, this outputs extensive debug information. It is highly recommended you set this to False when running on a real git project. DEBUGLITE: -> True or False, this outputs very lightweight debug information. DATABASE: -> True or False, this signals to write output to your Database. The tool will create the tables if they don't exist yet, but you must have postgres installed will an available database and schema specified in [Database]. This option will prompt you for your database password once the script gitcproc.py starts. CSV: -> True or False, this signals to write to a CSV file. This option is recommended if you want to miminize set up and test the tool out. LOGTIME: -> True or False, this signals to write performance time information. Currently it measure time to parse the log and time to write out to the Database or CSV file.

With these files created, to run the full process you can do something like: python gitcproc.py -d -wl -pl ../util/sample_conf.ini

The config file is mandantory, and option -d runs the project download step, -wl creates the git log for downloaded projects, and -pl parses the logs and sends the output to your choosen format.

#Running Tests This repository is set up to run travis CI on each commit at: https://travis-ci.org/caseycas/gitcproc

The test scripts are as follows src/logChunk/logChunkTestC.py src/logChunk/logChunkTestJAVA.py src/logChunk/logChunkTestPython src/logChunk/ghLogDbTest.py src/logChunk/ghLogDbTestPython.py src/logChunk/scopeTrackerTest.py

#Upcoming Features

  • Extensions to further languages
  • Better authorship recording (author + committer)
  • More output formatting features
  • Graphical/web interface
  • ...

About

Tool for analyzing git log messages and diffs.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%