EDiscovery refers to the management of electronically stored information in the litigations, dispute resolution proceedings, and investigations. Different machine learning techniques such as supervised classification and unsupervised clustering have been employed to reduce manual (linear human review) and increase investigative speed and efficiencies. We propose to improve on the state of the art of machine learning for EDiscovery by i) using topic modeling to provide greater power than commonly employed methods such as keyword search and Latent Semantic Analysis, ii) using identified topics for document categorization and ranking their relevance to a given query, and iii) using the topic framework to provide document summaries. Furthermore, to ensure the broad penetration of our effort, all software tools resulting from this effort will be implemented in the context of an open-source system that can serve as the basis for an open EDiscovery framework.
This section provides the general guidelines to access the Git Hub repository, coding style, enhancements, and issue tracking.
To Edit this file
See GitHub markdown online help
Enhancements and issues
- Use the issues tab to keep track of all issues and enhancements.
- When we check in the soruce code to the repository, specify the inssue or enhancement number in the checkin message.
e.g.
git commit -a -m'issue #1 fix: see the issue details for information.'
Git
To clone the ediscovery repository use the following command
git clone https://github.com/clintpgeorge/ediscovery
See crash course on Git SVN for more details. The following are some useful git commands
git pull # to update the local from the remote
git status # to see the local repository status
git add file_name # to add a new file file_name
git commit -a -m'[commit message]' # for commit all files in your local
git push # to update your commits to the master
Python
- Do not check in *.pyc files
- Follow coding standards
- Use argparse for handling arguments
- Use no hard coding in functions except for the test scripts, try to pass all constants as function parameters
pyLucene Installation
Ubuntu
- Install g++
- Install python-dev
- Download pylucene3.6 from http://mirror.sdunix.com/apache/lucene/pylucene/
- Execute the following command in JCC installation directory
- python setup.py build
- sudo python setup.py install
- Uncomment properties applicable to relevant platform(Linux,Mac,Windows) etc. from pylucene makefile
- Install ant
- Install setuptools
- Execute following commands from pylucene directory - make - sudo make install - make test
Windows 7 and 8
- Install Java JDK (32 bit) latest version
- Install PythonXY
- Add JRE & JDK paths to the Path Environment variable, e.g., G:\Program Files (x86)\Java\jre7\bin;G:\Program Files (x86)\Java\jre7\bin\client;G:\Program Files (x86)\Java\jdk1.7.0_51\bin;
- Install msvcr71.dll in C:\Windows\System32 and C:\Windows\SysWOW64
- Install pyLucene Extras using easy_install.
Topic Modeling
Topic modeling packages can be installed from the Gensim website.
Development Environment Setup
Please follow the following steps in the order given below for setting up development environment. The executables can be found in the software folder (Coming Soon)
-
Install wxPython, pywin32, Py2exe
-
Install wxFormBuilder
-
Delete files boot_common.py, and boot_common.pyc from C:\Python27\Lib\site-packages\py2exe. Add the boot_common.py from the software folder to the given path, compile it using the following code in python CLI
import py_compile py_compile.compile('boot_common.py')