Skip to content

xiang7/ForumMiner

Repository files navigation

ForumMiner

HOW TO INSTALL

Install the follwoing if you haven't installed them:

For windows, currently it only works on 32 bit arch

  1. Prerequisites:

    • python: https://www.python.org/download/

      To test, type 'python' in a command line. If the interactive python interface is entered, it is installed. Otherwise, install from the above url. For windows, download python 2.6 and remember to set the Path variable, so that 'python' works in cmd, see https://docs.python.org/2/using/windows.html.

    • c environment: For linux, see http://gcc.gnu.org/install/. To test, type 'gcc' in a command line. If the program asks for input file, it is installed. Otherwise, install from the above url.

      For windows if vc++ 2008 or vs 2008 is installed, skip this step. Otherwise, install visual c++ 2008 express if not installed, http://go.microsoft.com/?linkid=7729279, and Windows SDK.

  2. Python setup tool

  3. Python-dev

    • Linux sudo apt-get install python-dev
    • Windows - skip this step
  4. Run the following:

    • Linux - sudo python setup.py install
    • Windows - python setup_win.py install
  5. Install the NLTK data (this will take a while):

    • sudo python -m nltk.downloader all, omit sudo for windows
  6. Install the Numpy:

  7. Install esmre (for windows only)

    • code.google.com/p/esmre/downloads/list

HOW TO TEST INSTALLATION

Linux - sudo python setup.py test Windows - python tests.py

If see 'OK' in the last row of output, the installation is complete. Else, see the error message and try to install whatever is missing.

HOW TO RUN CODE

Note: For detailed instruction of running a .py file (such as WLZW.py), do

python WLZW.py -h

Two ways to use code:

  • Use as library, see html/index.html for interface reference
  • Use as command line programs, the following gives example

####WLZW - Extract frequent patterns from a corpus

python WLZW.py -i corpus -np 4 -o patterns

From a file named 'corpus', where each line is a document, extract frequent patterns (ngram) using 4 processes. Output is written to a file named 'patterns' where each line is a pattern (ngram).

For more details, see python WLZW.py -h

####FreqEst - Count the frequency of patterns and compute statistics python FreqEst.py -i corpus -l patterns -o entries

Use the same 'corpus' and 'patterns' file from previous step as input, output the frequency of each pattern and the importance statistics (TFIDF, MI, RIDF). Write the output into a file named 'entries'.

For more details, see python FreqEst.py -h

####SQLiteWrapper - Insert or query data into or from SQLite database

python SQLiteWrapper.py -i entries

Use the file 'entries' from previous step to insert the ngram entries into the SQLite database. You may select some of the records from the database to verify the correctness or for future use.

Select all entries, write into a file named 'selected':

python SQLiteWrapper.py -o selected

If only the patterns or ngrams themselves are needed, rather than all other statistics:

python SQLiteWrapper.py -o selected -ngram_only

It is also possible to select ngrams or ngram entries using criteria. E.g. Only select ngrams between certain lower bound and upper bound of mutual information (MI).

python SQLiteWrapper.py -o selected -mi_lower 0.0 -mi_upper 1.0

Or even specify more restrictions:

python SQLiteWrapper.py -o selected -mi_lower 0.0 -mi_upper 1.0 -tf_lower 1 -tf_upper 10 -max_num 1000 -pos NN

The above command select ngrams that has a MI between 0.0 and 1.0, a term frequency between 1 and 10, and a part-of-speech tag of NN. Max number of returned result is 1000 and the output is written to a file named 'selected'.

For more detail and a whole list of criteria, see

python SQLiteWrapper.py -h

####POS - Part-of-Speech tagger

python POS.py -i corpus -l tag_list -s ' ' -o tagged_list -u -m

'corpus' is the file used to populate the DB (from which the ngrams are extracted). 'tag_list' is a file of ngrams to be tagged (if any ngram is not in the corpus and thus not in the DB, they'll be ignored). The tagged result will be written to a file named 'tagged_list'. -u tells the program to update the DB with the POS tags. -m tells the program to output and thus update only the matched results (there is a matcher that matches useful POS patterns). The separator between document id and document content is specified by -s (space ' ' in this case, default to '$').

The -l option can be changed into -a so that no list of words to tag is needed. The program will tag all ngrams in the DB. See the following example:

python POS.py -i corpus -a -s ' ' -o tagged_list -u -m

For complete detail, see:

python POS.py -h

####ClassTagger - Give tags to patterns and tag the patterns in corpus

python ClassTagger.py -i tagged_ngram

Insert tags for ngrams into the DB using an input file 'tagged_ngram'. The file uses a format of 'ngram tag' in each line. Other separators can be specified between ngram and tag. E.g. using $$$$$ as the separator, use:

python ClassTagger.py -i tagged_ngram -s '$$$$$'

Another functionality is to use tagged ngrams in the DB to tag new documents:

python ClassTagger.py -t file_to_tag -to file_tagged

It uses tagged ngrams in the DB to tag the file named 'file_to_tag' and write the output to a file named 'file_tagged'. The input file could be the corpus.

For more details, see:

python ClassTagger.py -h

FILES

List of files in the directory

  • .py: python modules
  • html: code documentation
  • report: previous reports
  • config: config file for doxygen (to generate code documentation)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published