Python XLNetTokenizer.convert_ids_to_tokens示例

编程语言: Python

命名空间/包名称: transformers

类/类型: XLNetTokenizer

方法/功能: convert_ids_to_tokens

hotexamples.com的示例: 2

Python XLNetTokenizer.convert_ids_to_tokens - 已找到2个示例。这些是从开源项目中提取的最受好评的transformers.XLNetTokenizer.convert_ids_to_tokens现实Python示例。您可以评价示例，以帮助我们提高示例质量。

常用方法

显示隐藏

XLNetTokenizer(10)

convert_ids_to_tokens(2)

convert_tokens_to_ids(2)

示例#1

显示文件

    def _get_word2id(self, tokenizer: XLNetTokenizer, convert: bool = True):
        """
        Get model vocabulary in the form of mapping from words to indexes.

        Args:
            tokenizer: model tokenizer
            convert: whether to convert words with special underline scores characters
            into ordinary words and prepend word pieces with special characters.

        Returns:
            model vocabulary
        """
        word2id = dict()
        for idx in range(tokenizer.vocab_size):
            token: str = tokenizer.convert_ids_to_tokens(idx)
            if convert:
                # Prepare vocab suitable for substitution evaluation
                # Remove sentence piece underline and add special symbol to intra word parts
                if token.startswith(SPIECE_UNDERLINE) and len(token) > 1:
                    token = token[1:]
                else:
                    token = self.NON_START_SYMBOL + token
                word2id[token] = idx
        return word2id

示例#2

显示文件

文件： test_tokenization_xlnet.py 项目： zbloss/transformers

    def test_full_tokenizer(self):
        tokenizer = XLNetTokenizer(SAMPLE_VOCAB, keep_accents=True)

        tokens = tokenizer.tokenize("This is a test")
        self.assertListEqual(tokens, ["▁This", "▁is", "▁a", "▁t", "est"])

        self.assertListEqual(tokenizer.convert_tokens_to_ids(tokens),
                             [285, 46, 10, 170, 382])

        tokens = tokenizer.tokenize("I was born in 92000, and this is falsé.")
        self.assertListEqual(
            tokens,
            [
                SPIECE_UNDERLINE + "I",
                SPIECE_UNDERLINE + "was",
                SPIECE_UNDERLINE + "b",
                "or",
                "n",
                SPIECE_UNDERLINE + "in",
                SPIECE_UNDERLINE + "",
                "9",
                "2",
                "0",
                "0",
                "0",
                ",",
                SPIECE_UNDERLINE + "and",
                SPIECE_UNDERLINE + "this",
                SPIECE_UNDERLINE + "is",
                SPIECE_UNDERLINE + "f",
                "al",
                "s",
                "é",
                ".",
            ],
        )
        ids = tokenizer.convert_tokens_to_ids(tokens)
        self.assertListEqual(ids, [
            8, 21, 84, 55, 24, 19, 7, 0, 602, 347, 347, 347, 3, 12, 66, 46, 72,
            80, 6, 0, 4
        ])

        back_tokens = tokenizer.convert_ids_to_tokens(ids)
        self.assertListEqual(
            back_tokens,
            [
                SPIECE_UNDERLINE + "I",
                SPIECE_UNDERLINE + "was",
                SPIECE_UNDERLINE + "b",
                "or",
                "n",
                SPIECE_UNDERLINE + "in",
                SPIECE_UNDERLINE + "",
                "<unk>",
                "2",
                "0",
                "0",
                "0",
                ",",
                SPIECE_UNDERLINE + "and",
                SPIECE_UNDERLINE + "this",
                SPIECE_UNDERLINE + "is",
                SPIECE_UNDERLINE + "f",
                "al",
                "s",
                "<unk>",
                ".",
            ],
        )