Skip to content

smujjiga/pymm

Repository files navigation

Pymm: Python Wrapper for MetaMap

Python Wrapper for extracting candidate and mapping concepts using MetaMap. Pymm parses the XML output of the MetaMap. The below concept information are extracted:

  • score
  • matched word
  • cui
  • semtypes
  • negated
  • matched word start position
  • matched word end position
  • ismapping

The flag ismapping is set to True if it is a mapping concept else it is False for a candidate mapping.

Installation

git clone https://github.com/smujjiga/pymm.git
cd pymm
python setup.py install

Usage

Create Python MetaMap wrapper object by pointing it to locaiton of MetaMap

from pymm import Metamap
mm = Metamap(METAMAP_PATH)

We can check if metamap is running using

assert mm.is_alive()

Concept extraction is done via parse method

mmos = mm.parse(['heart attack', 'myocardial infarction'])

Parse method returns an iterator of Metamap Object iterators corresponding to each input sentence. Each Metamap Object iterator return the candidate and mapping concepts.

for idx, mmo in enumerate(mmos):
   for jdx, concept in enumerate(mmo):
     print (concept.cui, concept.score, concept.matched)
     print (concept.semtypes, concept.ismapping)

Python MetaMap wrapper object also support debug parameter which persists input and output files as well print the command line used to run the MetaMap

mm = Metamap(METAMAP_PATH, debug=True)

Sample

Below shown is a code snippet for extracting concepts on large number of sentences.

def read_lines(file_name, fast_forward_to, batch_size, preprocessing):
    sentences = list()
    with open(file_name, 'r') as fp:
        for i in range(fast_forward_to):
            fp.readline()

        for idx, line in enumerate(fp):
            sentences.append(preprocessing(line))
            if (idx+1) % batch_size == 0:
                yield sentences
                sentences.clear()
try:
    for i, sentences in enumerate(read_lines(CLINICAL_TEXT_FILE, last_checkpoint, BATCH_SIZE, clean_text)):
        timeout = 0.33*BATCH_SIZE
        try_again = False
        try:
            mmos = mm.parse(sentences, timeout=timeout)
        except MetamapStuck:
            # Try with larger timeout
            print ("Metamap Stuck !!!; trying with larger timeout")
            try_again = True
        except:
            print ("Exception in mm; skipping the batch")
            traceback.print_exc(file=sys.stdout)
            continue

        if try_again:
            timeout = BATCH_SIZE*2
            try:
                mmos = mm.parse(sentences, timeout=timeout)
            except MetamapStuck:
                # Again stuck; Ignore this batch
                print ("Metamap Stuck again !!!; ignoring the batch")
                continue
            except:
                print ("Exception in mm; skipping the batch")
                traceback.print_exc(file=sys.stdout)
                continue

        for idx, mmo in enumerate(mmos):
            for jdx, concept in enumerate(mmo):
                save(sentences[idx], concept)

        curr_checkpoint = (i+1)*BATCH_SIZE + last_checkpoint
        record_checkpoint(curr_checkpoint)
finally:
    mm.close()

Acknowledgement

This python wrapper is motivated by https://github.com/AnthonyMRios/pymetamap. Pymetamap parses the MMI output where as Pymm parses XML output. I decided to code Pymm targeting extraction of concept on huge corpus. I have used Pymm to extract candidate and mapping concepts of 10 Million sentence.