Python keywords示例

编程语言: Python

命名空间/包名称: corpkit

方法/功能: keywords

hotexamples.com的示例: 2

Python keywords - 已找到2个示例。这些是从开源项目中提取的最受好评的corpkit.keywords现实Python示例。您可以评价示例，以帮助我们提高示例质量。

示例#1

显示文件

文件： orientation.py 项目： agogear/corpkit

# <markdowncell>
# ### Keywords and ngrams

# <markdowncell>
# `corpkit` has some functions for keywording, ngramming and collocation. Each can take a number of kinds of input data:

# 1. a path to a subcorpus (of either parse trees or raw text)
# 2. `conc()` output
# 3. a string of text

# `keywords()` produces both keywords and ngrams. It relies on code from the [Spindle](http://openspires.oucs.ox.ac.uk/spindle/) project.

# <codecell>
from corpkit import keywords
keys, ngrams = keywords(lines)
for key in keys[:10]:
    print key
for ngram in ngrams:
    print ngram

# <markdowncell>
# You can also use `interrogator()` to search for keywords or ngrams. To do this, instead of a Tregex query, pass `'keywords'` or `'ngrams'`. You should also specify a dictionary to use as the reference corpus. If you specify `dictionary = 'self'`, a dictionary will be made of the entire corpus, saved, and used.

# <codecell>
kwds_bnc = interrogator(annual_trees, 'words', 'keywords', dictionary = 'bnc.p')

# <codecell>
kwds = interrogator(annual_trees, 'words', 'keywords', dictionary = 'self')

# <markdowncell>

示例#2

显示文件

# <markdowncell>
# Keywording is the process of generating a list of words that are unusually frequent in the corpus of interest. To do it, you need a *reference corpus*, or at least a *reference wordlist* to which your *target corpus* can be compared. Often, *reference corpora* take the form of very large collections of language drawn from a variety of spoken and written sources.

# Keywording is what generates word-clouds beside online news stories, blog posts, and the like. In combination with speech-to-text, it's used in Oxford University's [Spindle Project](http://openspires.oucs.ox.ac.uk/spindle/) to automatically archive recorded lectures with useful tags.

# We'll use corpkit, which relies on Spindle.

# <codecell>
! pip install corpkit
import corpkit
from corpkit import keywords

# <codecell>
# this tool works with raw text, not tokens!
keys, ngrams = keywords(raw.encode("UTF-8"))
for key in keys[:20]:
    print key

# <markdowncell>
# Success! We have keywords.

# > Keep in mind, the BNC reference corpus was created before ISIS and ISIL existed. *Moslem/moslems* is a dispreferred spelling of Muslim, used more frequently in anti-Islamic discourse. Also, it's unlikely that a transcriber of the spoken BNC would choose the Moslem spelling. *Having an inappropriate reference corpus is a common methodological problem in discourse analytic work*.

# <headingcell level=2>
# Collocation

# <markdowncell>
# > *You shall know a word by the company it keeps.* - J.R. Firth, 1957

# Collocation is a very common area of interest in corpus linguistics. Words pattern together in both expected and unexpected ways. In some contexts, *drug* and *medication* are synonymous, but it would be very rare to hear about *illicit* or *street medication*. Similarly, doctors are unlikely to prescribe the *correct* or *appropriate drug*.