示例#1
0
 def iteratively_contract_bigrams(self):
     """
     Procedure to iteratively contract bigrams (up to max_collocation_iterations times)
     that score higher on the collocation_function than the min_collocation_score
     """
     for i in range(self.max_collocation_iterations):
         bigramer = BigramCollocationFinder.from_documents(self.tokens_by_sent())
         mwes = list(
             bigramer.above_score(
                 self.collocation_score_function, self.min_collocation_score
             )
         )
         if len(mwes) == 0:
             break
         contracter = MWETokenizer(mwes)
         self.tokens_by_sent_by_doc_ = [
             contracter.tokenize_sents(doc) for doc in self.tokens_by_sent_by_doc()
         ]
示例#2
0
    "niga[fattened]v/i", "ŋeš[tree]n", "tag[touch]v/t"
]
tokenizer.tokenize(lemm_line)

# The tokenizer thus found three Multiple Word Expressions in this single line and connected the lemmas of the MWEs by underscores. The line also illustrates a limitation of this approach. The [epsd2/literary](http://oracc.org/epsd2/literary) edition of [Iddin-Dagan A](http://oracc.org/epsd2/literary/Q000447) represents the first word of line 148 as {udu}a-lum, taking "udu" (sheep) as a determinative (or semantic classifier). The edition of the list of animals in [OB Ura 3](http://oracc.org/dcclt/Q000001) in [DCCLT](http://oracc.org/dcclt), however, treats this same sign sequence as a sequence of two words: udu a-lum, lemmatized as udu\[sheep\]N aslum\[sheep\]N (line 8). Although aslum\[sheep\]N will result in a match, it will seem that the combination udu\[sheep\]N aslum\[sheep\]N does not appear in the literary corpus. Matches are only found if the words are represented in exactly the same way, and small inconsistencies in lemmatization may result in false negatives.
#
# We can now apply the MWE tokenizer on the entire data set, by re-tokenizing each list of lemmas in the `lemma` column of the `lit_lines` DataFrame. The function `tokenize_sents()` (for "tokenize sentences") can be used to tokenize a list of lists where each second-order list represents a sentence (or, in our case, a line) in one go. The result of this function is again a list of lists; it contains the same tokens, but now Multiple Word Expressions are connected by underscores.
#
# We extract the `lemma` from the `lit_lines` DataFrame and split each entry into a list - producing a list of list that can be fed as input to the MWETokenizer. The output is again a list of list - each line is represented by a list of lemmas. These lists are joined, so that each line is now again represented by a sequence of lemmas in a single string. This data is added as a new column (`lemma_mwe` to the DataFrame `lit_lines`
#
# The `lemma_mwe` column of the `lit_lines` dataframe will now represent the [epsd2/literary](http:oracc.org/epsd2/literary) data in a line-by-line presentation of lemmatizations, with underscores connecting lemmas if a corresponding sequence of lemmas exists as an Old Babylonian lexical entry. This version of the DataFrame `lit_lines` is pickled for use in the next notebook.

# In[ ]:

lemma_list = [lemma.split() for lemma in lit_lines["lemma"]]
lemma_mwe = tokenizer.tokenize_sents(lemma_list)
lit_lines["lemma_mwe"] = [' '.join(line) for line in lemma_mwe]
lit_lines.to_pickle('output/litlines.p')

# Now join all the tuples in the list `lex` with underscores, so that the multiple-word entries in the lexical corpus are represented in the same way as they are in the literary corpus. Thus the entry **udu diŋir-e gu₇-a** (sheep eaten by a god) has gone through the following transformations:
# * lemmatization: udu\[sheep\]n diŋir\[god\]n gu\[eat\]V/t
# * tuple (lex):   (udu\[sheep\]n, diŋir\[god\]n, gu\[eat\]V/t)
# * MWE (lex_vocab): udu\[sheep\]n_diŋir\[god\]n_gu\[eat\]V/t

# In[ ]:

lex_vocab = ["_".join(entry) for entry in lex]
lex_vocab.sort()

# We can now extract the column `lemma_mwe` from the `lit_lines` DataFrame in order to get a full list of all lemmas and Multiple Word Expressions in the entire [epsd2/literary](http:oracc.org/epsd2/literary) data set. In order to do so we will first join all entries in `lemma_mwe` (joining all literary lines into one big sequence of entries) and then split the result by blank space. That will create a list of all vocabulary items - with MWEs joined by underscores.
#