Python Corpus.apply Exemples

Langage de programmation: Python

Espace de nommage/Pack: tmtoolkit.corpus

Class/Type: Corpus

Méthode/Fonction: apply

Exemples au hotexamples.com: 2

Python Corpus.apply - 2 exemples trouvés. Ce sont les exemples réels les mieux notés de tmtoolkit.corpus.Corpus.apply extraits de projets open source. Vous pouvez noter les exemples pour nous aider à en améliorer la qualité.

Méthodes fréquemment utilisées

Afficher Cacher

Corpus(20)

from_folder(8)

from_files(6)

from_builtin_corpus(5)

copy(3)

keys(3)

items(2)

builtin_corpora(2)

replace_characters(2)

apply(2)

add_files(1)

split_by_paragraphs(1)

from_pickle(1)

get_doc_labels(1)

from_zip(1)

from_tabular(1)

add_doc(1)

filter_characters(1)

to_pickle(1)

Méthodes fréquemment utilisées

Corpus (20)

from_folder (8)

from_files (6)

from_builtin_corpus (5)

copy (3)

keys (3)

items (2)

builtin_corpora (2)

replace_characters (2)

apply (2)

Méthodes fréquemment utilisées

add_files (1)

split_by_paragraphs (1)

from_pickle (1)

get_doc_labels (1)

from_zip (1)

from_tabular (1)

add_doc (1)

filter_characters (1)

to_pickle (1)

Exemple #1

0

Afficher le fichier

Fichier : test_corpus.py Projet : ihavemanyquestions/tmtoolkit

def test_corpus_apply(texts): c = Corpus({str(i): t for i, t in enumerate(texts)}) c_orig = c.copy() orig_doc_labels = c.doc_labels orig_doc_lengths = c.doc_lengths assert isinstance(c.apply(str.upper), Corpus) assert c.doc_labels == orig_doc_labels assert c.doc_lengths == orig_doc_lengths for dl, dt in c.items(): assert c_orig[dl].upper() == dt

Exemple #2

0

Afficher le fichier

Fichier : bundestag18_tfidf.py Projet : yushu-liu/tmtoolkit

print('replacing characters in each document of the corpus') corpus.replace_characters(char_transl_table) print('these non-ASCII characters are left:') pprint(corpus.unique_characters - set(string.printable)) #%% Correct contractions # some contractions have a stray space in between, like "EU -Hilfen" where it should be "EU-Hilfen" # correct this by applying a custom function with a regular expression (RE) to each document in the corpus pttrn_contraction_ws = re.compile(r'(\w+)(\s+)(-\w+)') print('correcting wrong contractions') # in each document text `t`, remove the RE group 2 (the stray white space "(\s+)") for each match `m` corpus.apply(lambda t: pttrn_contraction_ws.sub(lambda m: m.group(1) + m.group(3), t)) #%% Create a TMPreproc object for token processing # this takes some time because the documents are directly tokenized print('creating TMPreproc object from corpus') preproc = TMPreproc(corpus, language='german') print('created: %s' % preproc) # we don't need this anymore, remove it to free memory del corpus #%% Calculate the total number of tokens in the whole corpus print('total number of tokens in the whole corpus:') print(sum(preproc.doc_lengths.values()))