Python words_from_archive示例

编程语言: Python

命名空间/包名称: autocorrect.utils

方法/功能: words_from_archive

hotexamples.com的示例: 4

Python words_from_archive - 已找到4个示例。这些是从开源项目中提取的最受好评的autocorrect.utils.words_from_archive现实Python示例。您可以评价示例，以帮助我们提高示例质量。

示例#1

显示文件

文件： nlp_parser.py 项目： RameezAijaz10P/tmpRepo

def parse(lang_sample):
    """tally word popularity using novel extracts, etc"""
    words = words_from_archive(lang_sample, include_dups=True)
    counts = zero_default_dict()
    for word in words:
        counts[word] += 1
    return set(words), counts

示例#2

显示文件

文件： nlp_parser.py 项目： InTheZ/autocorrect

def parse(lang_sample):
    """tally word popularity using novel extracts, etc"""
    words = words_from_archive(lang_sample, include_dups=True)
    counts = zero_default_dict()
    for word in words:
        counts[word] += 1
    return set(words), counts

示例#3

显示文件

文件： nlp_parser.py 项目： csyhuang/autocorrect

def parse(lang_sample, file_format='bz'):

    from autocorrect.utils import words_from_archive, words_from_txt, \
        zero_default_dict
    """tally word popularity using novel extracts, etc"""

    if file_format == 'bz':
        words = words_from_archive(lang_sample, include_dups=True)
    elif file_format == 'txt':
        words = words_from_txt(lang_sample)

    counts = zero_default_dict()
    for word in words:
        counts[word] += 1
    return set(words), counts

示例#4

显示文件

文件： word_lists.py 项目： RameezAijaz10P/tmpRepo

from autocorrect.utils import words_from_archive

# en_US_GB_CA is a superset of US, GB and CA
# spellings (color, colour, etc). It contains
# roughly half a million words. For this
# example, imagine it's just seven words...
#
# we (lower)
# flew (lower)
# to (lower)
# Abu (mixed)
# Dhabi (mixed)
# via (lower)
# Colombo (mixed)

LOWERCASE = words_from_archive('en_US_GB_CA_lower.txt')
# {'we', 'flew', 'to', 'via'}

CASE_MAPPED = words_from_archive('en_US_GB_CA_mixed.txt',
                                 map_case=True)
#  {abu': 'Abu',
#  'dhabi': 'Dhabi',
#  'colombo': 'Colombo'}
#
# Note that en_US_GB_CA_mixed.txt also contains
# acronyms/mixed case variants of common words,
# so in reality, CASE_MAPPED also contains: 
#
# {'to': 'TO',
#  'via': 'Via'}