Python TMPreproc.remove_common_tokens示例

编程语言: Python

命名空间/包名称: tmtoolkit.preprocess

类/类型: TMPreproc

方法/功能: remove_common_tokens

hotexamples.com的示例: 2

Python TMPreproc.remove_common_tokens - 已找到2个示例。这些是从开源项目中提取的最受好评的tmtoolkit.preprocess.TMPreproc.remove_common_tokens现实Python示例。您可以评价示例，以帮助我们提高示例质量。

常用方法

显示隐藏

TMPreproc(11)

tokenize(6)

pos_tag(3)

from_state(2)

save_state(2)

remove_common_tokens(2)

get_dtm(2)

tokens_to_lowercase(2)

filter_for_token(1)

copy(1)

add_stopwords(1)

stem(1)

clean_tokens(1)

remove_uncommon_tokens(1)

remove_tokens(1)

remove_special_chars_in_tokens(1)

remove_documents_by_name(1)

remove_chars_in_tokens(1)

from_tokens(1)

expand_compound_tokens(1)

lemmatize(1)

get_tokens(1)

add_special_chars(1)

filter_for_pos(1)

generate_ngrams(1)

from_tokens_datatable(1)

get_kwic_table(1)

示例#1

显示文件

文件： benchmark_preproc.py 项目： yushu-liu/tmtoolkit

preproc.pos_tag()
add_timing('pos_tag')

preproc.lemmatize()
add_timing('lemmatize')

preproc.remove_special_chars_in_tokens()
add_timing('remove_special_chars_in_tokens')

preproc.tokens_to_lowercase()
add_timing('tokens_to_lowercase')

preproc.clean_tokens()
add_timing('clean_tokens')

preproc.remove_common_tokens(0.9)
preproc.remove_uncommon_tokens(0.05)
add_timing('remove_common_tokens / remove_uncommon_tokens')

vocab = preproc.vocabulary
add_timing('get vocab')

tokens = preproc.tokens
add_timing('get tokens')

tokens_tagged = preproc.get_tokens(with_metadata=True, as_datatables=False)
add_timing('get tagged tokens')

dtm = preproc.get_dtm()
add_timing('get dtm')

示例#2

显示文件

文件： bundestag18_tfidf.py 项目： yushu-liu/tmtoolkit

vocab_doc_freq_df = pd.DataFrame({'token': list(vocab_doc_freq.keys()),
                                  'freq': list(vocab_doc_freq.values())})

print('top 50 tokens by relative document frequency:')
vocab_top = vocab_doc_freq_df.sort_values('freq', ascending=False).head(50)
print(vocab_top)

# plot this
plt.figure()
vocab_top.plot(x='token', y='freq', kind='bar')
plt.show()

#%% Further token cleanup

# we can remove tokens above a certain threshold of (relative or absolute) document frequency
preproc.remove_common_tokens(0.8)   # this will only remove "müssen"

# since we'll later use tf-idf, common words don't have much influence on the result and can remain

#%% Document lengths (number of tokens per document)

doc_labels = np.array(list(preproc.doc_lengths.keys()))
doc_lengths = np.array(list(preproc.doc_lengths.values()))

print('range of document lengths: %d tokens minimum, %d tokens maximum' % (np.min(doc_lengths), np.max(doc_lengths)))
print('mean document length:', np.mean(doc_lengths))
print('mean document length:', np.median(doc_lengths))

plt.figure()
plt.hist(doc_lengths, bins=100)
plt.title('Histogram of document lengths')