Skip to content

pombredanne/simsem

 
 

Repository files navigation

SimSem

Introduction

SimSem is a tool for semantic disambiguation using approximate string matching and is distributed under the restrictions of the ISC License. To accomplish this SimSem uses large collections of strings such as dictionaries, LibLinear as its machine-learning component and SimString for fast approximate string matching. Please see the publication mentioned below for details.

If you draw inspiration from or base your work on SimSem, please cite the below which is provided in BibTeX format:

@InProceedings{stenetorp2011simsem,
  author    = {Stenetorp, Pontus and Pyysalo, Sampo and Tsujii, Jun'ichi},
  title     = {SimSem: Fast Approximate String Matching 
      in Relation to Semantic Category Disambiguation},
  booktitle = {Proceedings of BioNLP 2011 Workshop},
  month     = {June},
  year      = {2011},
  address   = {Portland, Oregon, USA},
  publisher = {Association for Computational Linguistics},
  pages     = {136--145},
  url       = {http://www.aclweb.org/anthology/W11-0218}
}

Building

Clone this repository using git clone, then run the preparation script to download the lexical resources, create the databases,to do some code generation (ugly) and build external dependencies:

./prepare.sh

Running

Currently there are two end-user tools that use the Internal-SimString model. train.py and classify.py. Train a model using train.py and the training data on the following format:

${STRING}\t${TYPE}

You can use classify.py as follows:

echo 'NF-kB' | ./classify.py ${MODEL_PATH}

And you will get tab-separated output on the form:

NF-kB   [('Protein_complex', 0.9599138507870001), ... ]

Experiments

Experiments are run using test.py, use the -h flag for more information. For example, to replicate the main experiment (and plots) from the BioNLP 2011 publication (use the tag bionlp_2011) you would run:

mkdir bionlp_2011
./test.py -v -c INTERNAL -c INTERNAL-SIMSTRING -c INTERNAL-GAZETTER \
    -d BioNLP-ST-2011-Epi_and_PTM -d BioNLP-ST-2011-Infectious_Diseases \
    -d BioNLP-ST-2011-genia -d CALBC_II -d NLPBA -d SUPER_GREC \
    bionlp_2011 learning
./test.py bionlp_2011 plot

Resources

SimSem uses a large collection of lexical resources, the conversion and processing scripts for these resources can be found under data/simstring/res/. Since the resources are rather large they are distributed separately and can be downloaded here (mirror).

There are also data sets available in the BioNLP 2009 Shared Task format, two of which has been converted from other formats. All data sets also have been sentence split and tokenised in accordance to the BioNLP Shared Task 2011 pipeline. These resources can are further described on the project wiki be found under data/corpora and you may also be interested in prepare.sh which pre-processes and corrects some aspects of the data.

About

Semantic Category Disambiguation using SimString, large lexical resources and LibLinear

Resources

License

Stars

Watchers

Forks

Packages

No packages published