Skip to content

yuchenz/koalaNLP

Repository files navigation

koalaNLP

Useful NLP scripts

  • processCH.py

    Chinese data statistics and cleaning processor

  • preprocBPS.py

    preprocess data for Berkeley Parser

  • line2tree.py

    convert string format parse trees in a file into tree format ones in standard output, or

    convert a string format parse tree from standard input into a tree format one in standard output

  • mergeSort2F.py

    merge two sorted files into one sorted file

  • isLegalTree.py

    check if trees from stdin are legal trees that can be parsed by nltk.Tree

  • tree2latex.py

    given a file of parse trees (annotated in berkeley parse tree format), output a .tex file which draws the trees in latex

  • simplifyTree.py

    given a tree, reduce redundant levels (see code for explanations)

  • evalMalletMaxEnt.py

    evaluate the output of Mallet MaxEnt model

  • tokenizer.perl

    the tokenizer from moses (use -a option)

  • lowercase.perl

    the lowercaser from moses

for MT

  • mt/parallelSents.py

    wholeFile and wholeParaFile are sentence level parallel files, one sentence per line; partFile contains a part of the sentences in wholeFile; select the according parallel sentences in wholeParaFile into partParaFile

  • mt/wa2latex.py

    given foreign sentences file, English sentences file, gold word alignemnt file, auto word alignment file, all sentence aligned (one sentence per line), output a .tex file which generates a word alignment matrix using latex, output statistics about them

    helper module mt/wa2latex_helper.py

  • mt/featureModifier.py

    modify features for rules in a Moses phrase table

  • parse tree binarization for MT (method 1):

    • mt/berkeleyparsed2mosesxml.perl (from moses)

      convert berkeley parsed trees into moses' xml format trees

    • mt/koala-relax-parse (from moses)

      binarize trees in moses' xml format

      (binarize options: --LeftBinarize, --RightBinarize, --SAMT 1-4)

    • mt/mosesxml2berkeleyparsed.py

      convert moses' binarized xml format trees into berkeley format trees

      (only works on left or right binarized trees, doesn't work on SAMTed trees)

    $ echo "( (S (NN I) (NN you) (NN he) (NN she)))" | berkeleyparsed2mosesxml.perl | koala-relax-parse --RightBinarize | mosesxml2berkeleyparsed.py 
    
    (S (TOP (NN I) (^S (NN you) (^S (NN he) (NN she)))))
  • parse tree binarization for MT (method 2):

    • mt/treeBinarizer.py

      a rule-based parse tree binarizer

  • flatten parse trees for MT:

    • mt/flattenTree.py

      syntax: cat filename | flattenTree.py level

      filename is a file of parse trees, one tree per line; level is how much to flatten the trees (level >= 4)

  • mt/waMatrix_oneLine.py

    convert between oneline representation and matrix representation of the word alignments on a sentence pair

About

Useful NLP scripts

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published