-
Notifications
You must be signed in to change notification settings - Fork 9
/
preprocessing.py
249 lines (181 loc) · 11.6 KB
/
preprocessing.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
# coding: utf-8
"""
Created by: Shaheen Syed
Data: August 2018
The pre-processing phase can be seen as the process of going from a document source to an interpretable representation for the topic model algorithm. This phase is typically different
for full-text and abstract data. One of the main differences is that abstract data is often provided in a clean format, whereas full-text is commonly obtained by converting a PDF document
into its plain text representation.
Within this phase, an important part is to filter out the content that is not important from a topic model's point-of-view, rather than from a human’s point-of-view. Abstract data
usually comes in a clean format of around 300--400 words, and little additional text is added to it; typically the copyright statement is the only text that should be removed.
In contrast, full-text articles can contain a lot of additional text that has been added by the publisher. This is article meta-data and boilerplate. It is important that such
additional text is removed, and various methods to do so exist. Examples include: deleting the first cover page; deleting the first n-bits of the content; using regular expressions
or other pattern matching techniques to find and remove additional text, or more advanced methods. For full-text articles, a choice can be made to also exclude the reference
list or acknowledgment section of the publication.
Latent Dirichlet allocation, as well as other probabilistic topic models, are bag-of-words (BOW) models. Therefore, the words within the documents need to be tokenized; the process
of obtaining individual words (also known as unigrams) from sentences. For English text, splitting words on white spaces would be the easiest example. Besides obtaining unigrams, it
is also important to find multi-word expressions, such as two-word (bi-grams) or multi-word (n-grams) combinations. Named entity recognition (NER)---a technique from natural
language processing (NLP)---can, for instance, be used to find multi-word expressions related to names, nationalities, companies, locations, and objects within the documents.
The inclusion of bi-grams and entities allows for a richer bag-of-words representation than a standard unigram representation. Documents from languages with implicit word boundaries
may require a more advanced type of tokenization.
Although all tokens within a document serve an important grammatical or syntactical function, for topic modeling they are not all equally important. Words need to be filtered for
numbers, punctuation marks, and single-character words as they bear no topical meaning. Furthermore, stop words (e.g., the, is, a, which) are words that have no specific meaning
from a topical point-of-view, and such words need to be removed as well. For English, and a number of other languages, there exist fixed lists of stop words that can easily be used
(many NLP packages such as NLTK and Spacy include them). However, it is important to also create a domain-specific (also referred to as corpus-specific) list of stop words and filter
for those words. Such domain-specific stop words can also become apparent in the evaluation phase. If this is the case, going back to the pre-processing phase and excluding them would
be a good approach. Another approach to removing stop words is to use TF-IDF and include or exclude words within a certain threshold. Contrary to our analysis, some have indicated
that removing stop words have no substantial effect on model likelihood, topic coherence, or classification accuracy.
For grammatical reasons, different word forms or derivationally related words can have a similar meaning and, ideally, for a topic model analysis, such terms need to be
grouped (i.e., they need to be normalized). Stemming and lemmatization are two NLP techniques to reduce inflectional and derivational forms of words to a common base
form. Stemming heuristically cuts off derivational affixes to achieve some normalization, albeit crude in most cases. Stemming loses the ability to relate stemmed words
back to their original part-of-speech, such as verbs or nouns, and decreases the interpretability of topics in later stages. Lemmatization is a more sophisticated normalization method
that uses a vocabulary and morphological analysis to reduce words to their base form, called lemma. For increased topic interpretability, we recommend lemmatization over stemming.
Additionally, uppercase and lowercase words can be grouped for further normalization. The process of normalization is particularly critical for languages with a richer
morphology. Failing to do can cause the vocabulary to be overly large, which can slow down posterior inference, and can lead to topics of poor quality.
For reference articles see:
Syed, S., Borit, M., & Spruit, M. (2018). Narrow lenses for capturing the complexity of fisheries: A topic analysis of fisheries science from 1990 to 2016. Fish and Fisheries, 19(4), 643–661. http://doi.org/10.1111/faf.12280
Syed, S., & Spruit, M. (2017). Full-Text or Abstract? Examining Topic Coherence Scores Using Latent Dirichlet Allocation. In 2017 IEEE International Conference on Data Science and Advanced Analytics (DSAA) (pp. 165–174). Tokyo, Japan: IEEE. http://doi.org/10.1109/DSAA.2017.61
Syed, S., & Spruit, M. (2018a). Exploring Symmetrical and Asymmetrical Dirichlet Priors for Latent Dirichlet Allocation. International Journal of Semantic Computing, 12(3), 399–423. http://doi.org/10.1142/S1793351X18400184
Syed, S., & Spruit, M. (2018b). Selecting Priors for Latent Dirichlet Allocation. In 2018 IEEE 12th International Conference on Semantic Computing (ICSC) (pp. 194–202). Laguna Hills, CA, USA: IEEE. http://doi.org/10.1109/ICSC.2018.00035
Syed, S., & Weber, C. T. (2018). Using Machine Learning to Uncover Latent Research Topics in Fishery Models. Reviews in Fisheries Science & Aquaculture, 26(3), 319–336. http://doi.org/10.1080/23308249.2017.1416331
"""
# packages and modules
import logging, sys, spacy
from collections import Counter
import itertools
from database import MongoDatabase
from helper_functions import *
class Preprocessing():
def __init__(self):
logging.info('Initialized {}'.format(self.__class__.__name__))
# instantiate database
self.db = MongoDatabase()
# set utf8 encoding
reload(sys)
sys.setdefaultencoding('utf8')
def full_text_preprocessing(self, pdf_folder = os.path.join('files', 'pdf')):
"""
preprocess full-text publications
- convert pdf to plain text
- correct for carriage returns
- correct for end-of-line hyphenation
- remove boilerplate
- remove bibliography
- remove acknowledgements
Parameters
----------
pdf_folder : os.path
location where PDF documents are stored
"""
logging.info('Start {}'.format(sys._getframe().f_code.co_name))
# read pdf files that need to be converted
F = [x for x in read_directory(pdf_folder) if x[-4:] == '.pdf']
# read documents from DB that have already been processed so we can skip them
processed_documents = [ '{}-{}-{}'.format(x['journal'], x['year'], x['title']) for x in self.db.read_collection(collection = 'publications_raw')]
# loop over each file and convert pdf to plain and save meta data to DB
for i, f in enumerate(F):
# extract meta data from folder structure and file name
journal = f.split('/')[2]
year = f.split('/')[3]
title = f.split('/')[4].replace('-', ' ')[4:-4].strip()
# console output
print_doc_verbose(i, len(F), journal, year, title)
# check if PDF has already been processed
if '{}-{}-{}'.format(journal, year, title) in processed_documents:
logging.info('PDF document already processed, skipping ...')
continue
# convert content of PDF to plain text
content = pdf_to_plain(f)
# check if content could be extracted
if content is not None:
# fix soft hyphen
content = content.replace(u'\xad', "-")
# fix em-dash
content = content.replace(u'\u2014', "-")
# fix en-dash
content = content.replace(u'\u2013', "-")
# minus sign
content = content.replace(u'\u2212', "-")
# fix hyphenation that occur just before a new line
content = content.replace('-\n','')
# remove new lines/carriage returns
content = content.replace('\n',' ')
# correct for ligatures
content = content.replace(u'\ufb02', "fl") # fl ligature
content = content.replace(u'\ufb01', "fi") # fi ligature
content = content.replace(u'\ufb00', "ff") # ff ligature
content = content.replace(u'\ufb03', "ffi") # ffi ligature
content = content.replace(u'\ufb04', "ffl") # ffl ligature
"""
Remove boilerplate content:
Especially journal publications have lots of boilerplate content on the titlepage. Removing of this is specific for each
journal and you can use some regular expressions to identify and remove it.
"""
"""
Remove acknowledgemends and/or references
This is a somewhat crude example
"""
if content.rfind("References") > 0:
content = content[:content.rfind("References")]
"""
Remove acknowledgements
"""
if content.rfind("Acknowledgment") > 0:
content = content[:content.rfind("Acknowledgment")]
# prepare dictionary to save into MongoDB
doc = { 'journal' : journal, 'title' : title, 'year' : year, 'content' : content}
# save to database
self.db.insert_one_to_collection(doc = doc, collection = 'publications_raw')
def general_preprocessing(self, min_bigram_count = 5):
"""
General preprocessing of publications (used for abstracts and full-text)
Parameters
----------
min_bigram_count : int (optional)
frequency of bigram to occur to include into list of bigrams. Thus lower frequency than min_bigram_count will not be included.
"""
logging.info('Start {}'.format(sys._getframe().f_code.co_name))
# read document collection
D = self.db.read_collection(collection = 'publications_raw')
# setup spacy natural language processing object
nlp = setup_spacy()
# loop through the documents and correct content
for i, d in enumerate(D):
# check if tokens are already present, if so, skip
if d.get('tokens') is None:
# print to console
print_doc_verbose(i, D.count(), d['journal'], d['year'], d['title'])
# get content from document and convert to spacy object
content = nlp(d['content'])
# tokenize, lemmatization, remove punctuation, remove single character words
unigrams = word_tokenizer(content)
# get entities
entities = named_entity_recognition(content)
# get bigrams
bigrams = get_bigrams(" ".join(unigrams))
bigrams = [['{} {}'.format(x[0],x[1])] * y for x, y in Counter(bigrams).most_common() if y >= min_bigram_count]
bigrams = list(itertools.chain(*bigrams))
d['tokens'] = unigrams + bigrams + entities
# save dictionary to datbase
self.db.update_collection(collection = 'publications_raw', doc = d)
else:
logging.debug('Document already tokenized, skipping ...')
"""
internal helper function
"""
def print_doc_verbose(i, total, journal, year, title):
# console output
logging.debug('processing file: {}/{}'.format(i+1,total))
logging.debug('journal : {}'.format(journal))
logging.debug('year : {}'.format(year))
logging.debug('title : {}'.format(title))
def setup_spacy():
# setting up spacy
nlp = spacy.load('en')
# add some more stopwords; apparently spacy does not contain all the stopwords
for word in set(stopwords.words('english')):
nlp.Defaults.stop_words.add(unicode(word))
nlp.Defaults.stop_words.add(unicode(word.title()))
for word in nlp.Defaults.stop_words:
lex = nlp.vocab[word]
lex.is_stop = True
return nlp