Chinese translationese

Author: Hai Hu (huhai at indiana.edu), forked and modifed from https://github.com/lutzky/translationese

Original authors: Ohad Lutzky, Gal Star

Documentation of the original implementation is available here: https://translationese.readthedocs.org/

About

This is part of a project on the features of translated text. That is, what makes translations different from text written originlly in the same language. The code here extracts features of translations for downstream text classification tasks. The original implementation is for English, replicating results in this paper. The code here handles Chinese. The results are in a recently submitted paper (contact Hai Hu to get a draft version). To take a look at a similar study on Chinese translations, see our previous paper.

New features

Use Stanford CoreNLP instead of NLTK. CoreNLP is more accurate and also allows you to easily extend to other languages. You need to install and fire up a CoreNLP server locally and then process text data. You will also need a python wrapper for CoreNLP which is already in this folder. It is a modification of the wrapper from here.
Change the code from python2 to python3.
New translationese features (or modules) are added, some of which specific to Chinese. The newly added modules include: ba_structure, bei_structure, context-free grammar rules (CFGRs.py), char_rank, chengyu, etc.

The original code by Lutzky and Star have suffix _old.

How to run

Install Stanford CoreNLP. Note: our code works for version 3.9.1. You need to download 2 things, both from this link: 1) CoreNLP, 2) Chinese model. Unzip the first file, and you will have a folder: stanford-corenlp-full-2018-02-27, then put the second file (stanford-chinese-corenlp-2018-02-27-models.ja) in the folder.
Install Java from this link. Use Java 8 or 11 or 15.
Start CoreNLP server: open a terminal, going to the folder where CoreNLP is installed. (If installed on Desktop, type cd /home/hai/Desktop/stanford-corenlp-full-2018-02-27/). Then type java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000. You should see [main] INFO CoreNLP - StanfordCoreNLPServer listening at /0:0:0:0:0:0:0:0:9000.
Download this repository. Unzip. Go to the folder by, e.g., cd /home/hai/Desktop/translationese/. Now you can run translationese modules.
You can type python3 analyze.py -h to see the help message. You should see:

usage: analyze.py [-h] [-t T_DIR] [-o O_DIR] [--outfile OUTFILE] [-d DEST_DIR]
                  [-l LANG]
                  MODULE [MODULE ...]

Run a translationese analysis of T_DIR and O_DIR, using MODULE. Output is in
weka-compatible ARFF format. Specify 'ALL' as the module to automatically run
all possible variants on all possible modules.

positional arguments:
  MODULE                Available modules: CFGRs:0 CFGRs:1 CFGRs:2 CFGRs:3
                        CFGRs:4 CFGRs:5 CFGRs:6 CFGRs:7 average_pmi
                        ba_structure bei_structure character_n_grams:0
                        character_n_grams:1 character_n_grams:2 chengyu:0
                        chengyu:1 chengyu:2 cohesive_markers:0
                        cohesive_markers:1 cohesive_markers:2
                        cohesive_markers:3 cohesive_markers:4
                        contextual_function_words contractions depPronoun:0
                        depPronoun:1 depPronoun:2 depPronoun:3 depPronoun:4
                        depPronoun:5 depTriple:0 depTriple:1 depTriple:2
                        depTriple:3 depTriple:4 determiners:0 determiners:1
                        determiners:2 entropy:0 entropy:1 entropy:2 entropy:3
                        entropy:4 entropy:5 explicit_naming function_words
                        lexical_density lexical_variety:0 lexical_variety:1
                        lexical_variety:2 lexical_variety_char:0
                        lexical_variety_char:1 lexical_variety_char:2
                        mean_char_rank:0 mean_char_rank:1 mean_multiple_naming
                        mean_sentence_length mean_sentence_length_chars
                        mean_word_length mean_word_rank:0 mean_word_rank:1
                        most_frequent_chars:0 most_frequent_chars:1
                        most_frequent_chars:2 most_frequent_chars:3
                        most_frequent_words:0 most_frequent_words:1
                        most_frequent_words:2 normalization:0 normalization:1
                        normalization:2 normalization:3 pos_n_grams:0
                        pos_n_grams:1 pos_n_grams:2 pos_n_grams:3
                        positional_token_frequency pronouns punctuation:0
                        punctuation:1 punctuation:2 punctuation:3
                        ratio_to_passive_verbs repetitions:0 repetitions:1
                        single_naming subConj:0 subConj:1 subConj:2
                        syllable_ratio threshold_pmi treedepth word_n_grams:0
                        word_n_grams:1 word_n_grams:2 (or ALL)

optional arguments:
  -h, --help            show this help message and exit
  -t T_DIR              Directory of T (translated) texts [default: ./t/]
  -o O_DIR              Directory of O (original) texts [default: ./o/]
  --outfile OUTFILE     Write output to OUTFILE.
  -d DEST_DIR, --dest-dir DEST_DIR
                        OUTFILE[s] will be created in DEST_DIR.
  -l LANG, --language LANG
                        set LANGUAGE of the text ('en' for English; 'zh' for
                        Chinese, default)

Modules marked with a colon (:) indicate variants of the same module. OUTFILE
is MODULE_NAME.arff (including variant, if present).

To test a module say POS unigram, use:

python3 analyze.py pos_n_grams:0 -d myoutput/ -o o/ -t t/

This will process the original texts in the directory o and the translated texts in t, and save the output in the directory myoutput. One test text file is provided for o and t. You can then use the file zh_pos_n_grams:0.arff to run the classifier in WEKA.

Note: Test file for o is from Xinhua. Test file for t is from Reference News.

Comprehensive documentation

See the original documentation for more details.

Citation

If you use our code or work, please cite our paper on Chinese translationese:

@article{huinvestigating,
  title={Investigating translated Chinese and its variants using machine learning},
  author={Hu, Hai and K{\"u}bler, Sandra},
  journal={Natural Language Engineering},
  pages={1--34},
  publisher={Cambridge University Press},
  DOI={10.1017/S1351324920000182},
  year={2020}
}

Please also cite the original implementation:

@misc{translationese,
     author = {Ohad Lutzky and Gal Star},
     title = {Detecting translationese},
     note = {Online at \url{https://github.com/lutzky/translationese}; retrieved August 2020},
     year = {2013}
}

Name		Name	Last commit message	Last commit date
Latest commit History 164 Commits
.settings		.settings
docs		docs
o		o
stanfordcorenlp		stanfordcorenlp
t		t
tests		tests
translationese		translationese
translationese_old		translationese_old
.gitignore		.gitignore
.travis.yml		.travis.yml
README.md		README.md
analyze.py		analyze.py
analyze_old.py		analyze_old.py
memoize.py		memoize.py
memoize_old.py		memoize_old.py
requirements.txt		requirements.txt
setup.cfg		setup.cfg

huhailinguist/translationese

Folders and files

Latest commit

History

Repository files navigation

Chinese translationese

About

New features

How to run

Comprehensive documentation

Citation

About

Resources

Stars

Watchers

Forks

Languages