GitHub - MaximumEntropy/UPSITE

Branches Tags
Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
TEES @ a269f53		TEES @ a269f53
UPTEES		UPTEES
UPTEESoutput		UPTEESoutput
bionlp-1.3		bionlp-1.3
output		output
.gitmodules		.gitmodules
INSTALL.sh		INSTALL.sh
README		README
Repository files navigation

UPSITE
===============================

UPSITE is a large scale bioNLP classification system created by researchers from the University of Pittsburgh and Carnegie Mellon University. At the time of writing this documentation, the corresponding publication had been accepted to GLBio2015 conference and IEEE Transactions on Computational Biology and bioinformatics under the title “Text Mining for Validating Protein Interactions”. An exhaustive description of the system can be found within that paper. UPSITE has gained many functions over the course of its development and therefore can be used for a wide variety of ML related entity classification problems. The default and primary use of UPSITE is its ability to automatically synthesize and collate large portions of the ~24million document PubMed corpus and automatically classify entity-entity interactions (primarily PPIs). 
 
UPSITE was designed using only the highest performing modules available to the BioNLP community in an effort to minimize pipeline error propagation. This is great for optimizing performance, but makes its installation quite difficult. For this reason, I highly recommend accessing the publicly available Amazon Machine Instance (AMI) at the following URL: https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#Images:visibility=public-images;search=UPSITE;sort=name 
The AMI contains a fully pre-configured Ubuntu 14.04 environment and running version of UPSITE. 

We highly recommend that you run UPSITE using this AMI since all TEES models, corpora and NLP tools are already configured for you. If however you need to run UPSITE on your local skip the following instructions and proceed to the section that describes how to setup UPSITE on your local machine.

The following are the steps to use the AMI to run UPSITE

Step 1: Create an Amazon EC2 Instance with the UPSITE AMI

Note: These instructions are written for Ubuntu 14.04.2 but should work for most popular linux distributions as well as the mac operating system. Windows users will find this document helpful but will have to make minor adjustments.

1. Create an Amazon Web Services Account at https://aws.amazon.com

2. Navigate to the AWS console > EC2 > IMAGES > AMIs

3. In the drop down menu located in the search bar, select Public images

4. Type “UPSITE” into the search bar and press enter

◦ The original UPSITE file can be identified by AMI ID:ami-2f5a9144 and owner:503952387608. Alternatively, you may find the AMI at this web address: 

https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#Images:visibility=public-images;search=upsite;sort=name 

5. When the AMI for UPSITE is located, make sure it is selected and click the Launch button. You will be redirected to the EC2 configuration tutorial. Follow steps 1-7. Below are the recommended settings:

1. AMI: UPSITE

2. Instance Type: Use default settings (t2.micro)

3. Instance Configuration: Use default settings

4. Storage: 30GB

5. Instance Tag: key:myUPSITE Value:UPSITE1

6. Security Group: Type: SSH, Source:My IP 

6. When you are finished setting up your EC2 instance, press Launch

7. You will be instructed to assign a key pair. UPSITE is a Linux AMI. For Linus AMIs, the 

private key file allows you to securely SSH into your UPSITE instance from your local 

computer. Select “Create a new key pair” and name it myUPSITEkey. 

8. Select “Download Key Pair” and save the file (myUPSITEkey.pem) to your computer.

9. To verify your UPSITE instance is now launched, navigate to the EC2 dashboard and 

click on Instances > Instances. You should see UPSITE listed with Instance State:”running”. Note the public IP address listed for your instance on this screen, you will need it later. It should be something like: “54.86.228.154”

Step 2: SSH into your remote UPSITE EC2 instance

1. Open a command terminal on your local system and navigate to the folder containing myUPSITEkey.pem

2. Enter the following commands to SSH into your UPSITE instance:

◦ chmod 400 myUPSITEkey.pem 

◦ ssh -i myUPSITEkey.pem ubuntu@”Public_IP_Address_From_Step_9” 

▪ ex. ssh -i myUPSITEkey.pem ubuntu@54.86.228.154

3. You will then be presented with a welcome screen displaying information pertaining to 

your remote UPSITE linux instance. Congratulations, you have successfully gained access to a fully working version of UPSITE. 

If you configured UPSITE using the AMI, you can skip the subsequent instructions and proceed to the end of this README that describes the command line usage of this tool.

Running UPSITE on your local machine:

In the event that you need to run TEES on your local Ubuntu machine,

clone this repository with the following command

git clone --recursive https://github.com/MaximumEntropy/UPSITE.git

Run the installation script INSTALL.sh as follows:

1. chmod +x INSTALL.sh

2. ./INSTALL.sh

This script fetches the necessary dependencies required to run UPSITE and TEES and also configures the TEES event extraction system. 

Once the dependencies are installed, TEES can now be configured.

The TEES configuration is an interactive command line system to download related datasets, models and NLP tools required to run the system.

You should see a prompt that looks like 

[X] 1) Install classifier (SVM Multiclass)
[X] 2) Install models (TEES models for BioNLP'09-13 and DDI'11-13)
[X] 3) Install corpora (BioNLP'09-13 and DDI'11-13)
[X] 4) Install preprocessing tools (BANNER, BLLIP parser etc)
 *  c) Continue and install selected items
    q) Quit

* indicates the default option at , press c to continue

============================== Install Directory ==============================
1. By default, all data and tools will be installed to one directory, the
DATAPATH. You can later set the installation directory individually for each
component, or you can change the default path now.

2. TEES reads its configuration from a file defined by the environment variable
"TEES_SETTINGS". This environment variable must be set, and point to a
configuration file for TEES to work. By editing this configuration file you can
configure TEES in addition (or instead of) using this configuration program.

The "TEES_SETTINGS" environment variable is not set, but a configuration file
has been found in the default location. This installation program will use the
existing file, and by default install only missing components.
--------------------------------------------------------------------------------
    1) Change DATAPATH (/home/ubuntu/.tees)
    2) Change TEES_SETTINGS (/home/ubuntu/.tees_local_settings.py)
 *  c) Continue
================================================================================

press c to continue again. This step allows you to change the path where your TEES settings and data are stored. We recommend sticking to the default settings.

================================== Classifier ==================================
TEES uses the SVM Multiclass classifer by Thorsten Joachims for all
classification tasks. You can optionally choose to compile it from source if the
precompiled Linux-binary does not work on your system. The SVM_MULTICLASS_DIR
setting is already configured, so the default option is to skip installing.

SVM_MULTICLASS_DIR=/home/ubuntu/.tees/tools/SVMMultiClass
--------------------------------------------------------------------------------
[ ] 1) Compile from source
 *  i) Install
    s) Skip
================================================================================

This installs LibSVM for classification. Press i install.

==================================== Models ====================================
TEES models are used for predicting events or relations using classify.py.
Models are provided for all tasks in the BioNLP'11, BioNLP'09 and DDI'11 shared
tasks, for all BioNLP'13 tasks except BB task 1, and for task 9.2 of the DDI'13
shared task.

For a list of models and instructions for using them see
https://github.com/jbjorne/TEES/wiki/Classifying.
--------------------------------------------------------------------------------
[ ] 1) Redownload already downloaded files

 *  i) Install
    s) Skip
================================================================================

TEES uses various event models for its information extraction. Press i to install them.

=================================== Corpora ===================================
The corpora are used for training new models and testing existing models. The
corpora installable here are from the three BioNLP Shared Tasks (2009, 2011 and
2013) on Event Extraction (organized by University of Tokyo), and the two Drug-
Drug Interaction  Extraction tasks (DDI'11 and 13, organized by Universidad
Carlos III de Madrid).

The 2009 and 2011 corpora are downloaded as interaction XML files, generated
from the original Shared Task files. If you need to convert the corpora from
the original files, you can use the convertBioNLP.py, convertDDI.py and
convertDDI13.py programs located at Utils/Convert.

The 2013 corpora will be converted to interaction XML from the official corpus
files, downloaded automatically from the task websites. Installing the BioNLP'13
corpora will take about 10 minutes.

It is also recommended to download the official BioNLP Shared Task evaluator
programs, which will be used by TEES when training or testing on those corpora.
--------------------------------------------------------------------------------
[ ] 1) Redownload already downloaded files

[X] 2) Install BioNLP'11 corpora
[x] 3) Install BioNLP'09 (GENIA) corpus
[x] 4) Install DDI'11 (Drug-Drug Interactions) corpus

[X] 5) Install BioNLP'13 corpora
[x] 6) Install DDI'13 (Drug-Drug Interactions) corpus

[x] 7) Install BioNLP evaluators

  * i) Install
    s) Skip
================================================================================ 

TEES trains its algorithms on various biomedical coropora that can be downloaded in this step. We recommend downloading all corpora for best performance.

==================================== Tools ====================================
The tools are required for processing unannotated text and can be used as part
of TEES, or independently through their wrappers. For information and usage
conditions, see https://github.com/jbjorne/TEES/wiki/Licenses. Some of the tools
need to be compiled from source, this will take a while.

The external tools used by TEES are:

The GENIA Sentence Splitter of Tokyo University (Tsuruoka Y. et. al.)

The BANNER named entity recognizer by Robert Leaman et. al.

The BLLIP parser of Brown University (Charniak E., Johnson M. et. al.)

The Stanford Parser of the Stanford Natural Language Processing Group The
GENIA_SENTENCE_SPLITTER_DIR setting is already configured, so the default option
is to skip installing.

GENIA_SENTENCE_SPLITTER_DIR=/home/ubuntu/.tees/tools/geniassThe BANNER_DIR
setting is already configured, so the default option is to skip installing.

BANNER_DIR=/home/ubuntu/.tees/tools/BANNERThe BLLIP_PARSER_DIR setting is
already configured, so the default option is to skip installing.

BLLIP_PARSER_DIR=/home/ubuntu/.tees/tools/BLLIP/dmcc-bllip-parser-cb43c6cThe
STANFORD_PARSER_DIR setting is already configured, so the default option is to
skip installing.

STANFORD_PARSER_DIR=/home/ubuntu/.tees/tools/stanford-parser-2012-03-09
--------------------------------------------------------------------------------
[x] 1) Redownload already downloaded files

[x] 2) Install GENIA Sentence Splitter
[x] 3) Install BANNER named entity recognizer
[x] 4) Install BLLIP parser
[x] 5) Install Stanford Parser

  * i) Install
    s) Skip
================================================================================

This step installs all the NLP tools. Once again, we recommend that you install all the NLP tools.

WARNING: This step involves the download of several resources not developed by us. In the even that one of the models, corpora or NLP tools are unable to download or install, please try again after a while and if the problem persists, please use our .ami file to run our system on Amazon EC2. 

A full list of dependencies required to run this system are listed below:

1. Ruby (sudo apt-get install ruby)
2. Make (sudo apt-get install make)
3. g++ (sudo apt-get install g++-multilib)
4. flex (sudo apt-get install flex)
5. boost (sudo apt-get install libboost-all-dev)
6. Java JRE (sudo apt-get install default-jre) 
7. Java JDK (sudo apt-get install default-jdk)
8. Numpy (sudo apt-get install python-numpy)
9. Scipy (sudo apt-get install python-scipy)
10. Sklearn (sudo apt-get install python-sklearn)
11. NLTK (sudo apt-get install python-nltk)

Example usage of UPSITE: 

The system is intended to be used a command line tool. 

An example usage of this system is:

sudo python UPTEES/UPSITE.py -q MDM2 -w TERT -i single -n 60 -o mdm2_tert.tsv

This will run experiments using the REL11, EPI11 and ID11 TEES models, to run it on a specific model,

sudo python UPTEES/UPSITE.py -q MDM2 -w TERT -i single -n 60 -m REL11 -o mdm2_tert.tsv

For a detailed description of the command line arguments, run the following command 

python UPTEES/UPSITE.py -help

The runtime of this system is heavily dependent on the number of papers parsed. For best performance, we recommend setting n to atleast 60 papers to replicate the results in our paper. This entails several hours of runtime.

UPSITE was written by Adam G. Roth. The author can be contacted for questions or collorborations at Roth.AdamG@gmail.com

Disclaimer: The Turku Event Extraction System (TEES) although distributed as a submodule with this repository, is not developed by us.