forked from piantado/ngrampy
AlanKavanagh/ngrampy
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
ngrampy is a python class for manipulating google (or similarly formatted) n-gram data. It provides a python class for very basic table manipulations such that operations on tables are mimiced by operations on the hard drive, with huge n-gram files that cannot be read into RAM. This takes a lot of hard drive time, but can handle arbitrary file sizes (5~20gb is typical). This is *not* optimized for speed, since these things take a long time anyways and are typically run once. Usually, it makes more sense to process the google files once, concatinging and collapsing by some dates into a large file with all the ngrams (since this may take a few days). For this, the process-google.py script is fastest (much faster than LineFile). In collapsing dates, it makes a much smaller file (~10GB for eng-us 2grams) gzip -dc /home/piantado/Desktop/GoogleBooks/eng-us-all/2/* | python process-google.py /tmp/G-eng-us-all Or, unpigz is about 2x as fast as gzip on my computer (it multithreads fetching, file io, etc.) This perl script does not do any fancy filtering of the ngrams. To download data from google, you can use download.py NOTE: In general, you should use this library with export PYTHONIOENCODING=utf-8 so that you can handle utf-8 characters from google. NOTE: This splits columns in the text files by whitespace; if you want something else, you should merge with underscores or something NOTE: The pypy tends to run much faster than python for this! ======================================================== == LICENSE ======================================================== ngrampy is licensed under GPL 3.0 ======================================================== == INSTALLATION: ======================================================== Put this library somewhere--mine lives in /home/piantado/mit/Libraries/ngrampy/ Set the PYTHONPATH environment variable to point to ngrampy/: export PYTHONPATH=$PYTHONPATH:/home/piantado/Desktop/mit/Libraries/ngrampy You can put this into your .bashrc file to make it loaded automatically when you open a terminal. On ubuntu and most linux, this is: echo 'export PYTHONPATH=$PYTHONPATH:/home/piantado/Desktop/mit/Libraries/ngrampy' >> ~/.bashrc You can also do echo 'export PYTHONIOENCODING=utf-8' >> ~/.bashrc although this will change your default python encoding. And you should be ready to use the library
About
Tools in python for dealing with Google Books Ngram files and other similar data sets.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published