The last incarnation of TM Cleaner (under TMCleaner-MMT-API) is now integrated with the last version of MMT(1.0.2) [https://github.com/ModernMT/MMT/releases]. It means that the word alignment and tokenization is delegated to MMT. As before, TMCleaner-MMT-API can be used standalone or as a web service.
Train mode standalone
python generateFeaturesAndClassify.py --features --config Parameters/Fastalign/p-Training-Italian.txt
Test mode standalone
python generateFeaturesAndClassify.py --classify --config Parameters/Fastalign/p-Test-Italian.txt --mlalgorithm LogisticRegression
Web Service
Launch web service
python classifyOneByOne-Server.py --config Parameters/Fastalign/p-Test-Italian.txt --port 9090
Interogate web service ((64.81.80.214) is the machine where the service runs, 9090 is the port and "filetest.txt" a test file with source and target segments.
python clientSegments-json.py 64.81.80.214 9090 filetest.txt
TM Cleaner can be now used as a web service . Read the Web Service Documentation to see how.
- **Train **
python generateFeaturesAndClassify.py --features --config parameterFile.txt
TM Cleaner is software for identifying the wrong translation units (that contain segments that are not translation of each other) in translation memories or parallel corpora. The identification of these TU is stated as a classification task: the software returns “1” if it thinks that the TU is correct and “0” otherwise. TM Cleaner can work with any classification algorithm implemented in scikit-learn provided a connection to the algorithm is provided. TM Cleaner needs training data to create its models. TM Cleaner works in three mutually exclusive modalities:
- Bing. It uses Bing translation engine to translate the source segment and then uses a similarity measure between the target segment and the translation of the source
- Hunalign. It uses the Hunalign alignment score as a feature
- Fastalign. It computes various features based on the word alignments in the source and target segments.
TM Cleaner allows you to train a model, to classify new data and to evaluate the classifier performance. TM Cleaner has been tested on Linux and Mac OS X for Bing and Hunalign modalities. In Fastalign modality it only works on Linux Ubuntu because it relies MMT (Modern Machine Translation) distribution that only compiles on Linux.
- Train
python generateFeaturesAndClassify.py --features --config parameterFile.txt
- Classify
python generateFeaturesAndClassify.py --classify --config parameterFile.txt
- Evaluate
python EvaluateExamples.py --classified Evaluation/about-classified.txt --annotated Evaluation/about-manual.txt
For details read the documentation corresponding to the three modalities.
Checkout the python source code from this “github” repository. The best way to do it is by cloning.
The software you need before running TM Cleaner is:
- Python 2.7.x or higher.
- Java 8 or higher.
- Scikit-learn 0.17.x or higher.
- To work with Bing Translation engine you need an application identifier provided by Bing. For details see the tutorial Bing Documentation under Documentation folder.
- If you want to work with Hunalign sentence aligner you need to download and install Hunalign. For details see the tutorial on working with Hunalign sentence aligner Hunalign Documentation under Documentation folder.
- If you want to work with Fastalign word aligner you need to download and install MMT (Modern Machine Translation) java application. For details see the tutorial on working with Fastalign Documentation word aligner under Documentation folder.
For language identification we use Cybozu [https://github.com/shuyo/language-detection/blob/wiki/ProjectHome.md] language identifier. The language identifier is used through a java program that is called from the main script. We have bundled with this distribution the language profiles used by Cybozu.
In Fastalign modality we use a java program that merges the backward and forward alignments. Like Cybozu, this program is called from the main python script and is provided with this distribution.
All input files should be placed in a directory and should be in utf-8 format. Each file is composed of a number of lines separated by the end of line character (“\n”). Each line should have the following mandatory fields separated by “@#@”:
- Identifier_1 = it starts with 0 and is incremented for each line
- Identifier_2= in the example this field is the copy of the previous one. However the user can put whatever information s/he wants (for example a database id)
- Source Segment =The source segment
- Target Segment= The target segment
- The training file should contain the category (0 or 1) as the last item.
Example of a small file for English Italian language-pair:
0@#@0@#@Epistle of Jude@#@Lettera di Giuda
1@#@1@#@Tel: +351-21 000 86 00 Romania Novartis Animal Health d. o. o.@#@Tel: +351-21 000 86 00
The output will be created in a directory called “Classified” inside the input directory.
WARNING: At the next run the directory Classified will be deleted and recreated. Therefore, after a run, move the classified files somewhere else.
The format of the output file is the following:
- Identifier_1 =as in the input
- Identifier_2=as in the input
- For some algorithms like Logistic Regression we can output the probabilities for the classes 0 and 1
- The rule that decided the output (“ML” stays for machine learning)
- The inferred category (0 or 1)
Example of the previous file classified:
0@#@0 @#@0.04-0.96@#@ML@#@1
1@#@1@#@0.89-0.11@#@ML@#@0
The configuration files are under the directory Parameters. For each modality (Bing, Hunalign, Fastalign) you will find a directory with the configuration files for training (“p-Training-XXX.txt”) and testing (“p-Batch-XXX.txt”). The parameters are commented in each file on lines starting with the symbol “#”.
To see how to run the software in each modality read the corresponding tutorials in the Documentation directory. This introductory instructions are also available in General Documentation
The development of this tool has been supported by the People Programme (Marie Curie Actions) of the European Unions Framework Programme (FP7/2007-2013) under REA grant agreement no. 317471.
The author of this tool would like to thank Anna Samiotou from TAUS for testing and feedback.
The author of this tool, Eduard Barbu, can be contacted at : tm.cleaner at yahoo dot com
If you use this software and you are from Academy, please cite this paper :
Eduard Barbu. 2015. Spotting False Translation Segments in Translation Memories. In Proceedings of the Workshop Natural Language Processing for Translation Memories, pages 9–16, Hissar, Bulgaria, September 11. [https://www.aclweb.org/anthology/P/P16/P16-2047.pdf]