Skip to content

wy692/relna

 
 

Repository files navigation

Build Status codecov

relna - Biomedical Text Mining for Relation Extraction

Relna is a Text Mining (TM) tool for relation extraction for transcription factors and gene / gene products. To the best of our knowledge, it is the first text mining tool for relation extraction of transcriptor factors and associated proteins. It is part of a thesis at Technical University, Munich. This tool is built on the nalaf framework, developed as part of two other theses done at Technical University, Munich. The tool is generic enough that it can be extended by people with their own modules, eg. parsers, features, taggers etc. The method uses Support Vector Machines, and allows for the use of Tree Kernels.

nalaf framework is well documented here.

As part of the thesis, an associated corpus by the same name (relna) was annotated using tagtog. The relna corpus consists of 140 documents that have been semi-automatically annotated using GNormPlus for named entities and manually annotated for relations. The reason for relation extraction for transcription factors and gene / gene products, and corpus statistics is documented here.

Using our method, we achieve an F-measure of 69.3% on the relna corpus. The full results of our experiments are available here.

Brief Results

The pipeline used by relna is as follows:

Pipeline diagram

Install

Requirements

  • Python 3
  • SVMLight, linear vs tree kernel:
    • The default is to use SVMLight with linear kernels, already defined in https://github.com/Rostlab/nalaf.
    • If using SVMLight TK for tree kernels:
      • BLLIP Parser
      • SVMLight-TK-1.2
        • The easiest way to install it is to download compiled binaries from the official website.
        • You will have to fill up a form to get this, and make the build using the given Makefile.
        • Place the binaries svm_classify and svm_learn in your $PATH (note, that as of now, this is also needed in nalaf for SVMLight)

Install Code

  • Installation of nalaf
git clone https://github.com/Rostlab/nalaf
cd nalaf
python3 setup.py install
python3 -m nalaf.download_corpora
  • Installation of relna
git clone https://github.com/Rostlab/relna.git
cd relna
python3 setup.py install
python3 -m relna.download_corpora

Eventually, when the package is registered on PyPi, you can simply install relna by:

pip3 install relna

Examples

Run:

  • relna.py for a simple example how to use relna just for prediction with a pre-trained model
    • python3 relna.py -c [PATH SVMLight BIN DIR] -p 10383460
    • python3 relna.py -c [PATH SVMLight BIN DIR] -s "Conclusion: we find that Ubc9 interacts with the androgen receptor (AR), a member of the steroid receptor family of ligand-activated transcription factors. In transiently transfected COS-1 cells, AR-dependent but not basal transcription is enhanced by the coexpression of Ubc9."
    • python3 relna.py -c [PATH SVMLight BIN DIR] -d example.txt

Future Work

Important:

  • Implement neural networks (Theano or TensorFlow, when they release for Python 3) for training and classifying data and evaluate performance on that.
  • Implement bootstrapping for relation extraction (similar to nalaf, where it has been done for entities)
  • Implement multiple sentence models, looking at relations at a distance of one sentence and beyond

Not-So-Important:

  • Implement corereference resolution (might increase performance slightly)
  • Experiment with Tree Kernels (SVMLight TK), which achieves a very high precision P>91, to extract highly-accturate relationships from entire PubMed. That, in the end, may give better task extraction results since the lower recall (R~21) is compensated by the size of the large corpus of PubMed.
  • SpaCy plans to implement its own constituent parser, replace BLLIP with SpaCy for speed and efficiency (no linking to external C/C++ libraries)

About

Biomedical Relation Extraction for Transcription Factor and Gene / Gene Products (part of a Master Thesis at Rostlab, TUM)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • HTML 90.6%
  • Python 9.4%