Python align_tokens 예제들

프로그래밍 언어: Python

네임스페이스/패키지 이름: rasa.utils.train_utils

메소드/함수: align_tokens

hotexamples.com에서의 예제들: 4

Python align_tokens - 4개의 예제가 발견되었습니다. 이것들은 오픈소스 프로젝트에서 추출된 Python의 rasa.utils.train_utils.align_tokens에 대한 실세계 최고 등급의 예제들입니다. 예제들을 평가하여 예제의 품질 향상에 도움을 줄 수 있습니다.

예제 #1

파일 보기

파일: sentencepiece_tokenizer.py 프로젝트: anuragshas/rasa-nlu-examples

    def tokenize(self, message: Message, attribute: Text) -> List[Token]:
        """Tokenize the text using the SentencePiece model.
        SentencePiece adds a special char in front of (some) words and splits words into
        sub-words. To ensure the entity start and end values matches the token values,
        tokenize the text first using the whitespace tokenizer. If individual tokens
        are split up into multiple tokens, add this information to the
        respected tokens.
        """

        # perform whitespace tokenization
        tokens_in = super().tokenize(message, attribute)

        tokens_out = []

        for token in tokens_in:
            token_start, token_end, token_text = token.start, token.end, token.text
            # use SentencePiece model to tokenize the text
            split_token_strings = self._tokenize(token_text)

            # clean tokens (remove special chars and empty tokens)
            split_token_strings = self._clean_tokens(split_token_strings)

            tokens_out += train_utils.align_tokens(split_token_strings,
                                                   token_end, token_start)

        return tokens_out

예제 #2

파일 보기

    def _tokenize_example(
        self, message: Message, attribute: Text
    ) -> Tuple[List[Token], List[int]]:

        tokens_in = self.whitespace_tokenizer.tokenize(message, attribute)

        tokens_out = []

        token_ids_out = []

        for token in tokens_in:
            # use lm specific tokenizer to further tokenize the text
            split_token_ids, split_token_strings = self._lm_tokenize(token.text)

            split_token_ids, split_token_strings = self._lm_specific_token_cleanup(
                split_token_ids, split_token_strings
            )

            token_ids_out += split_token_ids

            tokens_out += train_utils.align_tokens(
                split_token_strings, token.end, token.start
            )

        return tokens_out, token_ids_out

예제 #3

파일 보기

파일: hf_transformer.py 프로젝트: bothub-it/bothub-nlp

    def _tokenize_example(
            self,
            message: Message,
            attribute: Text,
            model_size: int = 384) -> Tuple[List[Token], List[int]]:
        """Tokenize a single message example.

        Many language models add a special char in front of (some) words and split words into
        sub-words. To ensure the entity start and end values matches the token values,
        tokenize the text first using the whitespace tokenizer. If individual tokens
        are split up into multiple tokens, we make sure that the start and end value
        of the first and last respective tokens stay the same.

        Args:
            message: Single message object to be processed.
            attribute: Property of message to be processed, one of ``TEXT`` or ``RESPONSE``.
            model_size: Limit of tokens the model can handle (BERT = 512)

        Returns:
            List of token strings and token ids for the corresponding attribute of the message.
        """

        tokens_in = self.whitespace_tokenizer.tokenize(message, attribute)

        tokens_out = []

        token_ids_out = []

        for token in tokens_in:
            # use lm specific tokenizer to further tokenize the text
            split_token_ids, split_token_strings = self._lm_tokenize(
                token.text)

            split_token_ids, split_token_strings = self._lm_specific_token_cleanup(
                split_token_ids, split_token_strings)

            if len(tokens_out) + len(split_token_strings) >= model_size:
                logger.warning(
                    f"Sentence number of tokens overflowing model size. Skipping sentence exceeded tokens... "
                    f"Sentence text: '{message.text[:50]} ...' ")
                break

            token_ids_out += split_token_ids

            tokens_out += train_utils.align_tokens(split_token_strings,
                                                   token.end, token.start)

        return tokens_out, token_ids_out

예제 #4

파일 보기

파일: hf_transformers.py 프로젝트: tagshelfsrl/rasa

    def _tokenize_example(
        self, message: Message, attribute: Text
    ) -> Tuple[List[Token], List[int]]:
        """Tokenize a single message example.

        Many language models add a special char in front of (some) words and split words into
        sub-words. To ensure the entity start and end values matches the token values,
        tokenize the text first using the whitespace tokenizer. If individual tokens
        are split up into multiple tokens, we make sure that the start and end value
        of the first and last respective tokens stay the same.

        Args:
            message: Single message object to be processed.
            attribute: Property of message to be processed, one of ``TEXT`` or ``RESPONSE``.

        Returns:
            List of token strings and token ids for the corresponding attribute of the message.
        """

        tokens_in = self.whitespace_tokenizer.tokenize(message, attribute)

        tokens_out = []

        token_ids_out = []

        for token in tokens_in:
            # use lm specific tokenizer to further tokenize the text
            split_token_ids, split_token_strings = self._lm_tokenize(token.text)

            split_token_ids, split_token_strings = self._lm_specific_token_cleanup(
                split_token_ids, split_token_strings
            )

            token_ids_out += split_token_ids

            tokens_out += train_utils.align_tokens(
                split_token_strings, token.end, token.start
            )

        return tokens_out, token_ids_out