# <markdowncell> # ### Keywords and ngrams # <markdowncell> # `corpkit` has some functions for keywording, ngramming and collocation. Each can take a number of kinds of input data: # 1. a path to a subcorpus (of either parse trees or raw text) # 2. `conc()` output # 3. a string of text # `keywords()` produces both keywords and ngrams. It relies on code from the [Spindle](http://openspires.oucs.ox.ac.uk/spindle/) project. # <codecell> from corpkit import keywords keys, ngrams = keywords(lines) for key in keys[:10]: print key for ngram in ngrams: print ngram # <markdowncell> # You can also use `interrogator()` to search for keywords or ngrams. To do this, instead of a Tregex query, pass `'keywords'` or `'ngrams'`. You should also specify a dictionary to use as the reference corpus. If you specify `dictionary = 'self'`, a dictionary will be made of the entire corpus, saved, and used. # <codecell> kwds_bnc = interrogator(annual_trees, 'words', 'keywords', dictionary = 'bnc.p') # <codecell> kwds = interrogator(annual_trees, 'words', 'keywords', dictionary = 'self') # <markdowncell>
# <markdowncell> # Keywording is the process of generating a list of words that are unusually frequent in the corpus of interest. To do it, you need a *reference corpus*, or at least a *reference wordlist* to which your *target corpus* can be compared. Often, *reference corpora* take the form of very large collections of language drawn from a variety of spoken and written sources. # Keywording is what generates word-clouds beside online news stories, blog posts, and the like. In combination with speech-to-text, it's used in Oxford University's [Spindle Project](http://openspires.oucs.ox.ac.uk/spindle/) to automatically archive recorded lectures with useful tags. # We'll use corpkit, which relies on Spindle. # <codecell> ! pip install corpkit import corpkit from corpkit import keywords # <codecell> # this tool works with raw text, not tokens! keys, ngrams = keywords(raw.encode("UTF-8")) for key in keys[:20]: print key # <markdowncell> # Success! We have keywords. # > Keep in mind, the BNC reference corpus was created before ISIS and ISIL existed. *Moslem/moslems* is a dispreferred spelling of Muslim, used more frequently in anti-Islamic discourse. Also, it's unlikely that a transcriber of the spoken BNC would choose the Moslem spelling. *Having an inappropriate reference corpus is a common methodological problem in discourse analytic work*. # <headingcell level=2> # Collocation # <markdowncell> # > *You shall know a word by the company it keeps.* - J.R. Firth, 1957 # Collocation is a very common area of interest in corpus linguistics. Words pattern together in both expected and unexpected ways. In some contexts, *drug* and *medication* are synonymous, but it would be very rare to hear about *illicit* or *street medication*. Similarly, doctors are unlikely to prescribe the *correct* or *appropriate drug*.