Author: Hai Hu (huhai at indiana.edu), forked and modifed from https://github.com/lutzky/translationese
Original authors: Ohad Lutzky, Gal Star
Documentation of the original implementation is available here: https://translationese.readthedocs.org/
This is part of a project on the features of translated text. That is, what makes translations different from text written originlly in the same language. The code here extracts features of translations for downstream text classification tasks. The original implementation is for English, replicating results in this paper. The code here handles Chinese. The results are in a recently submitted paper (contact Hai Hu to get a draft version). To take a look at a similar study on Chinese translations, see our previous paper.
-
Use Stanford CoreNLP instead of NLTK. CoreNLP is more accurate and also allows you to easily extend to other languages. You need to install and fire up a CoreNLP server locally and then process text data. You will also need a python wrapper for CoreNLP which is already in this folder. It is a modification of the wrapper from here.
-
Change the code from python2 to python3.
-
New translationese features (or modules) are added, some of which specific to Chinese. The newly added modules include: ba_structure, bei_structure, context-free grammar rules (CFGRs.py), char_rank, chengyu, etc.
The original code by Lutzky and Star have suffix _old
.
- Install Stanford CoreNLP. Note: our code works for version 3.9.1. You need to download 2 things, both from this link: 1) CoreNLP, 2) Chinese model. Unzip the first file, and you will have a folder:
stanford-corenlp-full-2018-02-27
, then put the second file (stanford-chinese-corenlp-2018-02-27-models.ja
) in the folder. - Install Java from this link. Use Java 8 or 11 or 15.
- Start CoreNLP server: open a terminal, going to the folder where CoreNLP is installed. (If installed on Desktop, type
cd /home/hai/Desktop/stanford-corenlp-full-2018-02-27/
). Then typejava -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000
. You should see[main] INFO CoreNLP - StanfordCoreNLPServer listening at /0:0:0:0:0:0:0:0:9000
. - Download this repository. Unzip. Go to the folder by, e.g.,
cd /home/hai/Desktop/translationese/
. Now you can run translationese modules. - You can type
python3 analyze.py -h
to see the help message. You should see:
usage: analyze.py [-h] [-t T_DIR] [-o O_DIR] [--outfile OUTFILE] [-d DEST_DIR]
[-l LANG]
MODULE [MODULE ...]
Run a translationese analysis of T_DIR and O_DIR, using MODULE. Output is in
weka-compatible ARFF format. Specify 'ALL' as the module to automatically run
all possible variants on all possible modules.
positional arguments:
MODULE Available modules: CFGRs:0 CFGRs:1 CFGRs:2 CFGRs:3
CFGRs:4 CFGRs:5 CFGRs:6 CFGRs:7 average_pmi
ba_structure bei_structure character_n_grams:0
character_n_grams:1 character_n_grams:2 chengyu:0
chengyu:1 chengyu:2 cohesive_markers:0
cohesive_markers:1 cohesive_markers:2
cohesive_markers:3 cohesive_markers:4
contextual_function_words contractions depPronoun:0
depPronoun:1 depPronoun:2 depPronoun:3 depPronoun:4
depPronoun:5 depTriple:0 depTriple:1 depTriple:2
depTriple:3 depTriple:4 determiners:0 determiners:1
determiners:2 entropy:0 entropy:1 entropy:2 entropy:3
entropy:4 entropy:5 explicit_naming function_words
lexical_density lexical_variety:0 lexical_variety:1
lexical_variety:2 lexical_variety_char:0
lexical_variety_char:1 lexical_variety_char:2
mean_char_rank:0 mean_char_rank:1 mean_multiple_naming
mean_sentence_length mean_sentence_length_chars
mean_word_length mean_word_rank:0 mean_word_rank:1
most_frequent_chars:0 most_frequent_chars:1
most_frequent_chars:2 most_frequent_chars:3
most_frequent_words:0 most_frequent_words:1
most_frequent_words:2 normalization:0 normalization:1
normalization:2 normalization:3 pos_n_grams:0
pos_n_grams:1 pos_n_grams:2 pos_n_grams:3
positional_token_frequency pronouns punctuation:0
punctuation:1 punctuation:2 punctuation:3
ratio_to_passive_verbs repetitions:0 repetitions:1
single_naming subConj:0 subConj:1 subConj:2
syllable_ratio threshold_pmi treedepth word_n_grams:0
word_n_grams:1 word_n_grams:2 (or ALL)
optional arguments:
-h, --help show this help message and exit
-t T_DIR Directory of T (translated) texts [default: ./t/]
-o O_DIR Directory of O (original) texts [default: ./o/]
--outfile OUTFILE Write output to OUTFILE.
-d DEST_DIR, --dest-dir DEST_DIR
OUTFILE[s] will be created in DEST_DIR.
-l LANG, --language LANG
set LANGUAGE of the text ('en' for English; 'zh' for
Chinese, default)
Modules marked with a colon (:) indicate variants of the same module. OUTFILE
is MODULE_NAME.arff (including variant, if present).
- To test a module say POS unigram, use:
python3 analyze.py pos_n_grams:0 -d myoutput/ -o o/ -t t/
This will process the original texts in the directory o and the translated texts in t, and save the output in the directory myoutput. One test text file is provided for o and t. You can then use the file zh_pos_n_grams:0.arff to run the classifier in WEKA.
Note: Test file for o is from Xinhua. Test file for t is from Reference News.
See the original documentation for more details.
If you use our code or work, please cite our paper on Chinese translationese:
@article{huinvestigating,
title={Investigating translated Chinese and its variants using machine learning},
author={Hu, Hai and K{\"u}bler, Sandra},
journal={Natural Language Engineering},
pages={1--34},
publisher={Cambridge University Press},
DOI={10.1017/S1351324920000182},
year={2020}
}
Please also cite the original implementation:
@misc{translationese,
author = {Ohad Lutzky and Gal Star},
title = {Detecting translationese},
note = {Online at \url{https://github.com/lutzky/translationese}; retrieved August 2020},
year = {2013}
}