Python text_to_wordsの例

プログラミング言語: Python

名前空間/パッケージ名: blingfire

メソッド/関数: text_to_words

hotexamples.comのコード掲載数: 11

Python text_to_words - 11件のコード例が見つかりました。すべてオープンソースプロジェクトから抽出されたPythonのblingfire.text_to_wordsの実例で、最も評価が高いものを厳選しています。コード例の評価を行っていただくことで、より質の高いコード例が表示されるようになります。

コード例 #1

ファイルを表示

 def transform(self, docs):
     docvecs = np.zeros((len(docs), self.gram_length))
     print("making vectors")
     for index, doc in enumerate(tqdm_notebook(docs)):
         for word, count in Counter(text_to_words(doc)).items():
             v = (self[word] * count) / (1 + self.idf[word])
             docvecs[index] += v
     return docvecs

コード例 #2

ファイルを表示

ファイル: text.py プロジェクト: wayne9qiu/s2search

def fix_author_text(s):
    """Author text gets special treatment.
    No de-dashing, no tokenization, and 
    replace periods by white space.
    """
    if pd.isnull(s):
        return ''
    s = unidecode(s)
    # fix cases when quotes are repeated
    s = re.sub('"+', '"', s)
    # no periods as those make author first letter matching hard
    s = re.sub(r'\.', ' ', s)
    s = replace_special_whitespace_chars(s)
    s = standardize_whitespace_length(s)
    return text_to_words(s).lower().strip()

コード例 #3

ファイルを表示

ファイル: encode.py プロジェクト: jind11/TitleStylist

 def encode_lines(self, lines):
     """
     Encode a set of lines. All lines will be encoded together.
     """
     enc_lines = []
     for line in lines:
         line = line.strip()
         if len(line) == 0 and not self.args.keep_empty:
             return ["EMPTY", None]
         if self.args.tokenizer == 'bpe':
             tokens = self.encode(line)
             enc_lines.append(" ".join(tokens))
         else:
             enc_lines.append(text_to_words(line))
     return ["PASS", enc_lines]

コード例 #4

ファイルを表示

ファイル: text.py プロジェクト: wayne9qiu/s2search

def fix_text(s):
    """General purpose text fixing using nlpre package
    and then tokenizing with blingfire
    """
    if pd.isnull(s):
        return ''
    s = unidecode(s)
    # fix cases when quotes are repeated
    s = re.sub('"+', '"', s)
    # dashes make quote matching difficult
    s = re.sub('-', ' ', s)
    s = replace_special_whitespace_chars(s)
    # tokenize
    s = text_to_words(s).lower().strip()
    # note: removing single non-alphanumerics
    # means that we will match ngrams that are
    # usually separate by e.g. commas in the text
    # this will improve # of matches but also
    # surface false positives
    return remove_single_non_alphanumerics(s)

コード例 #5

ファイルを表示

            UnicodeSegmentTokenizer(word_bounds=True).tokenize,
        ),
        ("VTextTokenizer('en')", VTextTokenizer("en").tokenize),
        ("CharacterTokenizer(4)", CharacterTokenizer(4).tokenize),
    ]

    if sacremoses is not None:
        db.append(("MosesTokenizer()", sacremoses.MosesTokenizer().tokenize))
    if spacy is not None:
        from spacy.lang.en import English

        db.append(("Spacy en", English().tokenizer))

    if blingfire is not None:
        db.append(
            ("BlingFire en", lambda x: blingfire.text_to_words(x).split(" ")))

    for label, func in db:
        t0 = time()

        out = []

        for idx, doc in enumerate(data):
            out.append(func(doc))

        dt = time() - t0

        n_tokens = sum(len(tok) for tok in out)

        print("{:>45}: {:.2f}s [{:.1f} MB/s, {:.0f} kWPS]".format(
            label, dt, dataset_size / dt, n_tokens * 1e-3 / dt))

コード例 #6

ファイルを表示

 def fit(self, docs):
     self.idf = defaultdict(int)
     for doc in docs:
         for word in set(text_to_words(doc)):
             self.idf[word] += 1

コード例 #7

ファイルを表示

def blingf_tokenizer(s: str):
    return text_to_words(s)

コード例 #8

ファイルを表示

ファイル: eval_tokenization.py プロジェクト: joshlk/vtext

 def bling_tokenkizer(lang):
     return lambda x: blingfire.text_to_words(x).split(" ")

コード例 #9

ファイルを表示

import sys
from blingfire import text_to_words

for l in sys.stdin:
    if l.strip():
        print(text_to_words(l.strip()))
    else:
        print('')

コード例 #10

ファイルを表示

def word_tokenize(sent):
    return text_to_words(sent).split(' ')

コード例 #11

ファイルを表示

def word_tokenize(string):
    """Tokenize space delimited string with blingfire."""
    return text_to_words(string).split(' ')