Skip to content

kashenfelter/emi

 
 

Repository files navigation

The EMI project

Overview

This project exists to profile python code used in a Natural Language Processing project.

Organization

The directory structure is as follows:

emi/ +-- build | +-- data | +-- dependencies | +-- dist | +-- emi.egg-info | +-- profiling | +-- README | +-- runProf.sh | +-- runTests.sh | +-- setup.py | +-- src |
+-- test | +-- todo.org

with the major components simply being the src/, test/, and profiling/ directories. With test/, I will attempt to follow, at least minimally, a test-driven development style, e.g. writing a failing test, writing the minimal necessary code to fix the failing test, then moving on.

For profiling/, work has gone into investigating how we should best profile code for performance. To that end, I have included basic support for the cProfile library which ships with python's standard lib. I have also included two third-party libraries, line_profiler and memory_profiler, which provide more textured information about the runtime behavior and memory usage of a given program.

Setup

Treating the emi project as a module, in good python fashion, means including a setup.py script in the root. Practically, this means that we can have subdirectories (with init.py files) refer to each other without touching the PYTHON_PATH variable, so we can tell python where to find our libraries. This, however, is more of a side-effect of the overall setup.py philosophy, which looks further ahead to deployment and shipping logistics. Thus the main output of running setup.py is the creation and population of the build/, dist/, and emi.egg-info/ directories, which make the root look busier than it really is.

Although we're pushing (we've pushed) a version of the emi project that has already been "set up", you may periodically refresh the state of the project by running the following from the root, as per any pyPI package:

$ python setup.py build +$ python setup.py install+ $ python setup.pu develop

<2017-06-12> So I breezed over this when reading up on setup.py, but using the command "install" ends up preventing you from importing modified versions of the installed directories, e.g. I could not make a change to src/count_skipgrams.py, import that in profiling/main.py, and see the appropriate change. It somehow caches the installed version. You could then reinstall every time you make a change, but this clearly not ideal.

I searched around for "updating module" and "reloading module" and "clearing sys module path," which led me to the apparently more fully-featured "importlib" module, but the following lines were still useless:

improt importlib importlib.invalidate_caches() skip = importlib.import_module("src.count_skipgrams") skip = importlib.reload(skip)

So I decided to just look at the setup.pu again, where I promptly found that running

$ python setup.py develop

allows imported modules to reflect changes in their source code, exactly as we want. We can get craftier about "installing" and "developing" in the future.

sigh

Usage

At the outset, or after making any changes to the project, you should run the following:

$ runTests.sh

which will hopefully tell you if you broke anything. Test support is currently flimsy and more demonstrative than useful, that is, there's very low coverage.

To track how well the program is running, you will want to make use of the runProf.sh script. This has been written as a small unix utility, accepting a few command line arguments (choose which profiler, which functions to profile). It simply passes those arguments to a python script, which is written largely identically to the test script.

Still working out some kinks.

Overall, we will probably hone in on a single use-case for the profiling, and it could be that most of the intended features are dropped in favor of a simpler but more direct profiling methodology.

Miscellaneous

I keep track of goals and progress in the todo.org file kept in the root directory. Org-mode is a language built inside emacs that offers support for formatting book-keeping files, such as to-do lists. In emacs, this means there's a lot of interactivity that comes out of the box, e.g. displaying and contracting lists, headings, moving around the file like its a directory editor, etc. which get lost in any other text editor, or even an older version of emacs.

a moment of silence for those peope not using emacs.

If you don't care about those facets of the project (which you probably don't), then feel free to ignore it.

Contact Info

About

Empirical studies of mutual information in text

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 67.8%
  • Jupyter Notebook 15.5%
  • C++ 12.3%
  • Shell 3.3%
  • Makefile 1.1%