Esempi in Python per Corpus.apply

Linguaggio di programmazione: Python

Spazio dei nomi/nome del pacchetto: tmtoolkit.corpus

Classe/tipologia: Corpus

Metodo/funzione: apply

Esempi su hotexamples.com: 2

Corpus.apply in Python: 2 esempi trovati. Questi sono i migliori esempi reali in Python per tmtoolkit.corpus.Corpus.apply, estratti da progetti open source. Li puoi valutare, per aiutarci a migliorare la qualità dei nostri esempi.

Metodi utilizzati di frequente

Mostra Nascondi

Corpus(20)

from_folder(8)

from_files(6)

from_builtin_corpus(5)

copy(3)

keys(3)

items(2)

builtin_corpora(2)

replace_characters(2)

apply(2)

add_files(1)

split_by_paragraphs(1)

from_pickle(1)

get_doc_labels(1)

from_zip(1)

from_tabular(1)

add_doc(1)

filter_characters(1)

to_pickle(1)

Esempio n. 1

Mostra file

File: test_corpus.py Progetto: ihavemanyquestions/tmtoolkit

def test_corpus_apply(texts):
    c = Corpus({str(i): t for i, t in enumerate(texts)})
    c_orig = c.copy()
    orig_doc_labels = c.doc_labels
    orig_doc_lengths = c.doc_lengths

    assert isinstance(c.apply(str.upper), Corpus)

    assert c.doc_labels == orig_doc_labels
    assert c.doc_lengths == orig_doc_lengths

    for dl, dt in c.items():
        assert c_orig[dl].upper() == dt

Esempio n. 2

Mostra file

File: bundestag18_tfidf.py Progetto: yushu-liu/tmtoolkit

print('replacing characters in each document of the corpus')
corpus.replace_characters(char_transl_table)

print('these non-ASCII characters are left:')
pprint(corpus.unique_characters - set(string.printable))

#%% Correct contractions

# some contractions have a stray space in between, like "EU -Hilfen" where it should be "EU-Hilfen"
# correct this by applying a custom function with a regular expression (RE) to each document in the corpus
pttrn_contraction_ws = re.compile(r'(\w+)(\s+)(-\w+)')

print('correcting wrong contractions')
# in each document text `t`, remove the RE group 2 (the stray white space "(\s+)") for each match `m`
corpus.apply(lambda t: pttrn_contraction_ws.sub(lambda m: m.group(1) + m.group(3), t))

#%% Create a TMPreproc object for token processing

# this takes some time because the documents are directly tokenized
print('creating TMPreproc object from corpus')
preproc = TMPreproc(corpus, language='german')
print('created: %s' % preproc)

# we don't need this anymore, remove it to free memory
del corpus

#%% Calculate the total number of tokens in the whole corpus

print('total number of tokens in the whole corpus:')
print(sum(preproc.doc_lengths.values()))