SimSem is a tool for semantic disambiguation using approximate string matching and is distributed under the restrictions of the ISC License. To accomplish this SimSem uses large collections of strings such as dictionaries, LibLinear as its machine-learning component and SimString for fast approximate string matching. Please see the publication mentioned below for details.
If you draw inspiration from or base your work on SimSem, please cite the below which is provided in BibTeX format:
@InProceedings{stenetorp2011simsem,
author = {Stenetorp, Pontus and Pyysalo, Sampo and Tsujii, Jun'ichi},
title = {SimSem: Fast Approximate String Matching
in Relation to Semantic Category Disambiguation},
booktitle = {Proceedings of BioNLP 2011 Workshop},
month = {June},
year = {2011},
address = {Portland, Oregon, USA},
publisher = {Association for Computational Linguistics},
pages = {136--145},
url = {http://www.aclweb.org/anthology/W11-0218}
}
Clone this repository using git clone
, then run the preparation script to
download the lexical resources, create the databases,to do some code
generation (ugly) and build external dependencies:
./prepare.sh
Currently there are two end-user tools that use the Internal-SimString
model. train.py
and classify.py
. Train a model using train.py
and the
training data on the following format:
${STRING}\t${TYPE}
You can use classify.py
as follows:
echo 'NF-kB' | ./classify.py ${MODEL_PATH}
And you will get tab-separated output on the form:
NF-kB [('Protein_complex', 0.9599138507870001), ... ]
Experiments are run using test.py
, use the -h
flag for more
information. For example, to replicate the main experiment (and plots) from
the BioNLP 2011 publication (use the tag bionlp_2011
) you would run:
mkdir bionlp_2011
./test.py -v -c INTERNAL -c INTERNAL-SIMSTRING -c INTERNAL-GAZETTER \
-d BioNLP-ST-2011-Epi_and_PTM -d BioNLP-ST-2011-Infectious_Diseases \
-d BioNLP-ST-2011-genia -d CALBC_II -d NLPBA -d SUPER_GREC \
bionlp_2011 learning
./test.py bionlp_2011 plot
SimSem uses a large collection of lexical resources, the conversion and
processing scripts for these resources can be found under
data/simstring/res/
. Since the resources are rather large they are
distributed separately and can be downloaded here
(mirror).
There are also data sets available in the
BioNLP 2009 Shared Task format, two of which has been
converted from other formats. All data sets also have been sentence split and
tokenised in accordance to the
BioNLP Shared Task 2011 pipeline. These resources
can are further described on the project wiki be found under
data/corpora
and you may also be interested in prepare.sh
which
pre-processes and corrects some aspects of the data.