GitHub - MSheshera/meta_headers_ffn: Impliments a simple feed forward network to extract metadata from research papers.

Description:

Impliments a simple feed forward neural network for classifying tokens in a research paper as being one of six types of metadata information. Other models which were also part of the project are here:

Mollys BiLSTM,

Akul and Adityas Logistic Regression baseline

Dataset:

GROTOAP2: http://www.dlib.org/dlib/november14/tkaczyk/11tkaczyk.html

Description of directories and files:

feat_extract/*.py: Code to extract features from the data stored as TrueViz xml files and serialize them to npy binary files. Extraction uses on Molly McMohans xml parsing code; also, extraction code is based on her feature extraction code for the BiLSTM which tries to solve the same problem.

utils/*.py: General IO utilities.

/feat_extract/*.py: Mollys xml parsing code; code to extract features from parsed xml and write to gigantic numpy matrices.

settings.py: Contains global settings.

pre_proc.py: Code to standardize extracted features.

learning.py: Code to learn the model. Model specified here.

evaluate.py: Code to evaluate the trained model(s).

driver.py: Runs pipeline and prints results to STDOUT.

Running this code:

The following things need to happen:

Create the simstring databases for use during feature extraction. Only need to do this one. (Run: stringmatching.py)
Go over the xml files; extract features and write to a gigantic numpy feature matrix; this is the slowest step. Do this individually first for the train set and then the test set. (Run with appropriate argumets: grotoap_to_npy.py -h)
Train and test a model etc. (Run with appropriate argumets: driver.py -h)

You mainly need to set output and input paths manually. Most other things happen automatically. All output directories get created if they dont exist (exceptions mentioned below).

In fe_settings.py: Set the following variables:
- lexicon_dir: to point to the directory with all the lexicon text files.
- simstringdb_dir: Path to the directory where you want the output simstring databases.
- grotoap_dir_train: Top level GROTOAP train split directory. Should contain the numbered directories with the .cxml files in them. For example: grotoap_dir_train/00/*.cxml grotoap_dir_train/01/*.cxml etc.
- grotoap_dir_test: Top level GROTOAP test split directory. Should contain the numbered directories with the .cxml files in them. For example: grotoap_dir_test/00/*.cxml grotoap_dir_test/01/*.cxml etc.
- out_dir: Directory to write extracted feature matrices (npy files) into.
- map_dir: Directories to which maps (what int value represents what label etc) should be written.
In settings.py: Set the following variables:
- l_train_path: Full name to train npy file. Extracted in feature extraction run.
- l_dev_path: Full name to dev npy file. Extracted in feature extraction run.
- l_test_path: Full name to test npy file. Extracted in feature extraction run.
- l_map_path: Directories from which to read maps for the labels.
- l_check_dir: Chcekpoint directory. Directory where models should be saved. Direcory not automatically created if doesnt exist.

Example run:

# Create simstring databases.
$ python stringmatching.py
# Create train dataset and label int maps.
$ python grotoap_to_npy.py -dataset train
# Use int maps from above and create test set.
$ python grotoap_to_npy.py -dataset test
# Transform features to be standardized; train default model on local machine; evaluate model.
$ python driver.py -t --machine local --action def_model
# Transform features to be standardized; train tuned model on local machine; save the model.
$ python driver.py -t --machine local --action tuned_model
# Transform features to be standardized; load saved model on local machine; evaluate on test data.
$ python driver.py -t --machine local --action named_model

Note:

All of this was run on the Swarm2 cluster at UMass CICS and has a tonne of things hard coded to run there. Needs a tonne more work to be easy to run elsewhere.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat_extract

feat_extract

utils

utils

.gitignore

.gitignore

README.md

README.md

driver.py

driver.py

evaluate.py

evaluate.py

learning.py

learning.py

pre_proc.py

pre_proc.py

settings.py

settings.py

Repository files navigation

Description:

Dataset:

Description of directories and files:

Running this code:

Note:

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
feat_extract		feat_extract
utils		utils
.gitignore		.gitignore
README.md		README.md
driver.py		driver.py
evaluate.py		evaluate.py
learning.py		learning.py
pre_proc.py		pre_proc.py
settings.py		settings.py

MSheshera/meta_headers_ffn

Folders and files

Latest commit

History

Repository files navigation

Description:

Dataset:

Description of directories and files:

Running this code:

Note:

About

Resources

Stars

Watchers

Forks

Languages