Skip to content

alexbowe/keyphrase

Repository files navigation

Distributed Key Phrase Extractor

Author: Alex Bowe

Email: alex@alexbowe.com

Description

A program to deploy a Hadoop MapReduce job that extracts noun-phrases using NLTK to POS-tag, chunk using a grammar (from S. N. Kim et al, "Evaluating n-gram based evaluation metrics for automatic keyphrase extraction") and ranks them using TF-IDF.

Obtaining

To clone this repository:

$ git clone http://github.com/alexbowe/keyphrase.git

This will create a directory keyphrase in your working directory. Note that this won't allow you to submit changes to the master repository.

Running

You must have Hadoop and Dumbo installed. Just type:

./run.sh

This will copy the contents of the text folder to HDFS and format the results to work with our evaluation script.

While debugging, you may want to run it Dumbo in local mode:

./run.sh -l

Dependencies

PROVIDED:

NOT PROVIDED:

License

Anyone can use my work however they wish.

NLTK is distributed under the Apache License Version 2.0. PyYAML is distributed under the MIT License.

About

Key phrase extraction using Hadoop + Dumbo + NLTK

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published