Glossary Term Extraction from the CrowdRe Dataset

This is the code used to extract and analyze glossary term candidates from the CrowdRE dataset in the conference paper proposal by Gemkow, Conzelmann, Hartig and Vogelsang.

The file "Extended Report.pdf" presents an extended version of our findings that also discusses the effects of varying elements of our processing pipelines on the extracted glossary term candidates.

The paper, and this code, builds on the CrowdRE dataset: P. K. Murukannaiah, N. Ajmeri, and M. P. Singh, “The smarthome crowd requirements dataset,” https://crowdre.github.io/murukannaiah-smarthome-requirements-dataset/ , Apr. 2017.

Replicating the analysis

Install the prerequisite packages nltk (for language processing) and openpyxl (for Excel export) in the Python environment that you will be using. Please note that nltk needs to be installed including the WordNet corpus; it may be necessary to execute "nltk.download('wordnet')" in the Python shell after downloading the package.
Download or clone this repository to your local machine.
Download the csv version of the dataset from the url mentioned above and place the requirements.csv file in the data folder.
Execute the src/analysis.py file in a Python 3 interpreter

The results of the analysis will be output to the console.

How the analysis works

The main workhorse of the analysis script is the function glossary_extraction in the file pipeline.py. Through its parameters, this function allows to vary each step in the pipeline along the alternatives described in the paper. In general, for each comparison, analysis.py runs this function twice with different pipeline configurations and then compares the outputs.

NB: We do not focus on the immedate effects of changing a pipeline step (i.e. the words directly removed by a filter), but on the change of the ultimate pipeline output caused by changing a pipeline step. This is important since changes in the middle of the pipeline may have important repercussions on later steps (e.g. when ommitting stemming, many important concepts are later removed by the statistical filter because each individual form occurs to rarely in the dataset to fulfill the frequency criterion).

Re-using intermediate results

Training a PoS tagger and PoS-tagging the requirements are computationally expensive steps that should not be re-run every time an analysis in the later stages of the pipeline is changed. Therefore, both the tagger itself and the tagged requirements are saved as pickle files in the temp folder. As a default, the glossary_extraction function starts with the PoS-tagged requirements. By setting the tag_mode parameter, you can enforce a new PoS-tagging using the existing tagger, or training and applying a new tagger from scratch. If you want to analyse changes in the very early pipeline steps (e.g. tokenization), this is necessary since the pre-computed tagged requirements will not reflect the consequences of such changes.

Note that the temp folder also contains an intermediate result from analyzing the WordNet corpus for comparison (filter_index.pickle). It should not be necessary to re-run this benchmark creation, although the code to do so is included in indexfilter.py for completeness.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
src		src
target		target
temp		temp
Extended Report.pdf		Extended Report.pdf
LICENSE		LICENSE
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

src

src

target

target

temp

temp

Extended Report.pdf

Extended Report.pdf

LICENSE

LICENSE

readme.md

readme.md

Repository files navigation

Glossary Term Extraction from the CrowdRe Dataset

Replicating the analysis

How the analysis works

Re-using intermediate results

About

Releases

Packages

Languages

License

tub-aset/crowdre-glossary

Folders and files

Latest commit

History

Repository files navigation

Glossary Term Extraction from the CrowdRe Dataset

Replicating the analysis

How the analysis works

Re-using intermediate results

About

Resources

License

Stars

Watchers

Forks

Languages