Python Tokenizer.formatの例

プログラミング言語: Python

名前空間/パッケージ名: tokenizer.tokenizer

クラス/型: Tokenizer

メソッド/関数: format

hotexamples.comのコード掲載数: 4

Python Tokenizer.format - 4件のコード例が見つかりました。すべてオープンソースプロジェクトから抽出されたPythonのtokenizer.tokenizer.Tokenizer.formatの実例で、最も評価が高いものを厳選しています。コード例の評価を行っていただくことで、より質の高いコード例が表示されるようになります。

よく使われるメソッド

表示非表示

Tokenizer(30)

predict(8)

tokenize(7)

format(4)

txt(4)

analyze(3)

raw_tokenize(3)

run(3)

mr(2)

token_to_id(2)

tokenize_seq_file(1)

tokenize_text_file(1)

tokens_from_sentence(1)

コード例 #1

ファイルを表示

ファイル: corpus.py プロジェクト: Unipisa/biaffine-parser

    def load(cls, path, fields, tokenizer_lang, tokenizer_dir, use_gpu, verbose=True, max_sent_length=math.inf):
        tokenizer = Tokenizer(lang=tokenizer_lang, dir=tokenizer_dir, use_gpu=use_gpu, verbose=True)

        sentences = []
        fields = [field if field is not None else Field(str(i))
                  for i, field in enumerate(fields)]

        with open(path, 'r') as f:
            lines = []
            for line in tokenizer.format(tokenizer.predict(f.read())):
                line = line.strip()
                if not line:
                    if len(lines) > max_sent_length:
                        logger.info('Discarded sentence longer than max_sent_length:',
                              len(lines), file=sys.stderr)
                        lines = []
                        continue
                    sentences.append(Sentence(fields, lines))
                    lines = []
                else:
                    if not line.startswith('#'):
                        # append fake columns
                        line = '{}\t{}'.format(line, '\t'.join(['_' for i in range(len(CoNLL._fields) - len(line.split('\t')))]))
                        assert len(CoNLL._fields) == len(line.split('\t')), '{} - {} vs {}'.format(line, len(CoNLL._fields), len(line.split()))
                    lines.append(line)

        return cls(fields, sentences)

コード例 #2

ファイルを表示

ファイル: test_tokenizer.py プロジェクト: Unipisa/biaffine-parser

    def test_corpus_load(self):
        tokenizer = Tokenizer(**self.args)
        raw_text_file = '/project/piqasso/Collection/IWPT20/train-dev/UD_Italian-ISDT/it_isdt-ud-dev.txt'

        with open(raw_text_file) as fin:
            for line in tokenizer.format(tokenizer.predict(fin.read())):
                if line and not line.startswith('#'):
                    assert len(line.split('\t')) == 2, line

コード例 #3

ファイルを表示

    def test_corpus_load(self):
        tokenizer = Tokenizer(**self.args)
        sin = io.StringIO(
            "Un corazziere contro Scalfaro. L'attore le disse baciami o torno a riprendermelo."
        )

        for line in tokenizer.format(tokenizer.predict(sin.read())):
            if line and not line.startswith('#'):
                assert len(line.split('\t')) == 10, line

コード例 #4

ファイルを表示

ファイル: corpus.py プロジェクト: baoduy123/diaparser

    def load(cls,
             path,
             fields,
             tokenizer_lang,
             tokenizer_dir,
             verbose=True,
             max_sent_length=math.inf):
        tokenizer = Tokenizer(lang=tokenizer_lang,
                              dir=tokenizer_dir,
                              verbose=verbose)

        sentences = []
        fields = [
            field if field is not None else Field(str(i))
            for i, field in enumerate(fields)
        ]

        with open(path, 'r') as f:
            lines = []
            for line in tokenizer.format(tokenizer.predict(f.read())):
                line = line.strip()
                if not line:
                    if len(lines) > max_sent_length:
                        logger.info(
                            'Discarded sentence longer than max_sent_length:',
                            len(lines),
                            file=sys.stderr)
                        lines = []
                        continue
                    sentences.append(Sentence(fields, lines))
                    lines = []
                else:
                    if not line.startswith('#'):
                        # append empty columns
                        line += '\t_' * (len(CoNLL._fields) -
                                         len(line.split('\t')))
                    lines.append(line)

        return cls(fields, sentences)