-
Notifications
You must be signed in to change notification settings - Fork 0
philip30/qa4mre
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
NAIST-QACLEF ---------------------------------------------------------------------------- Author : Philip Arthur Release Year : 2013 This software is the technical implementation of the following paper: http://www.phontron.com/paper/arthur13clef.pdf In the collaboration with Graham Neubig for the training algorithm (T-MERT) that maximizes the C@1 objective function. http://phontron.com/tmert Note: the file 'tmert.py' is the slight modification of original tmert. The modification is to encapsulate run_mert method so the script can be imported from other python script. License: ---------------------------------------------------------------------------- All rights reserved. This program and the accompanying materials are made available under the terms of the GNU Lesser General Public License (LGPL) version 2.1 which accompanies this distribution, and is available at http://www.gnu.org/licenses/lgpl-2.1.html This library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details. Release LOG --------------------------------------------------------------------------- v1.1.0 - Add POS TAG - Add Word Splitter - Add Matrix view to Report - Change from Tupple based to List based - Change Porter Stemmer to Word Net Lemmatizer - Save Preprocessed Data separately - Readjust l from 1.0 to 0.5 v1.0.0 - Initial Release Installation: ---------------------------------------------------------------------------- [ FUNDAMENTAL ] - Python 2.7 The software is mainly written in python. So you need to install python first. Currently it supports only python 2.7.x. http://www.python.org/download/releases/2.7/ - Java Java is used in some part where program needs to connect with stanford's jar. Also include java in your environment variable "PATH". The java script in lib path was compiled with javac 1.6.0_27 http://www.oracle.com/technetwork/java/javase/overview/index.html [ SPECIALIZED ] - NLTK NLTK is and stands for natural language toolkit. This software is used for preprocessing. http://nltk.org/install.html - Stanford NER Stanford Named Entity Recognition is used as a part of preprocessing. Please download the jar http://nlp.stanford.edu/software/CRF-NER.shtml, EXTRACT and PUT the name of its path in configuration.py. The default configuration is '/usr/share/stanford-ner/stanford-ner.jar' Please change it according to your machine. The used tagset is english.conll.4class.distsim.crf.ser.gz [it bounds with correference resolution method in preprocessing.py] - Stanford Parser Stanford Parser is used as part of preprocessing to split the input sentence. http://nlp.stanford.edu/software/lex-parser.shtml same with StanfordNER please EXTRACT and PUT the name of its path in configuration.py. Technical Usage Questions ------------------------------------------------------------------------------ 1. What is the main script? - The main script is 'qaclef.py' 2. I want to work with QA4MRE GS 2011 data only [train and test with the data] how can I? - By executing 'python qaclef.py --selftest --data 2011' 3. I want to work with multiple data and test on them, how can I? - By executing 'python qaclef.py --selftest --data 2011 2012 2013' *example when working with 2011, 2012, 2013* 4. I want to train the system with 2011 and test it on 2012, how can I? - By executing 'python qaclef.py --data 2011 --test 2012 --train' 5. I want to change the parameter (e.g. the ngram or the threshold), how can I - adding parameter --ngram=[some_positive_integer] and --threshold=[some_float] would do. Some Technical Details ------------------------------------------------------------------------------- 1. This software runs good in at least 4GB memory. 2. The software basically stores its output on file system. You can find the detail of the output in 'cache' folder. 3. The software is equipped with 'by-passing' feature that will by pass some process, if it founds the saved output in 'cache' folder (according to its process). - Preprocess -> will look on '[edition]-preprocessed.txt' .) It will take the particular edition data on corresponding file. .) To force preprocessing add '--preprocess' flag .) Run software with '--preprocessonly' to do preprocessing only. - TMERT Training -> will look on 'weight.txt' .) To force training add '--train' flag Cache ------------------------------------------------------------------------------- As it is mentioned above, the software will write its process in file system. However not all of them are used to by pass processes. Some of them are intended for debugging and analyzing. They are: - 1 - 5 [preprocessing-details] -> for preprocessing - final.txt -> final test model (written in json) - qa-mert_input.txt -> generated input for tmert.py. If you wish to run tmert seperately, this is the generated train data input for latest run Report ------------------------------------------------------------------------------- The system can generate human readable document by adding '--report' for the execution. The report is written to 'report/report.txt' FAQ ------------------------------------------------------------------------------- 1. How can I add feature to the system? - Basically we add details of the algorithm in metric.py and just map it on scoring.py 2. How can I refine the preprocessing of the system? - By modifying preprocessing.py 3. How can I modify the model of the system. - Please take a look at model_builder.py Contact ----------------------------- Philip Arthur: philip.arthur.om0@is.naist.jp Graham Neubig: neubig@is.naist.jp
About
No description, website, or topics provided.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published