Python SentencePieceProcessor.SampleEncodeAsPieces示例

编程语言: Python

命名空间/包名称: sentencepiece

方法/功能: SampleEncodeAsPieces

hotexamples.com的示例: 2

Python SentencePieceProcessor.SampleEncodeAsPieces - 已找到2个示例。这些是从开源项目中提取的最受好评的sentencepiece.SentencePieceProcessor.SampleEncodeAsPieces现实Python示例。您可以评价示例，以帮助我们提高示例质量。

常用方法

显示隐藏

Load(30)

SentencePieceProcessor(30)

PieceToId(14)

EncodeAsIds(12)

EncodeAsPieces(12)

IdToPiece(6)

DecodeIds(5)

GetPieceSize(4)

SampleEncodeAsPieces(2)

Decode(1)

DecodePieces(1)

EncodeAsSerializedProto(1)

示例#1

显示文件

文件： tokenizers.py 项目： gretelai/gretel-synthetics

def _log_sample_data(model_dir: str, sp: spm.SentencePieceProcessor):
    training_data_path = Path(model_dir) / const.TRAINING_DATA
    if not training_data_path.is_file():
        logging.info("Training data not found for SP sampling")
        return

    with open(training_data_path) as fin:
        sample = fin.readline().strip()

    logging.info(f"Tokenizer model vocabulary size: {len(sp)} tokens")
    logging.info(
        "Mapping first line of training data\n\n{}\n ---- sample tokens mapped to pieces ---- > \n{}\n"
        .format(repr(sample),
                ", ".join(sp.SampleEncodeAsPieces(sample, -1, 0.1))))
    logging.info(
        "Mapping first line of training data\n\n{}\n ---- sample tokens mapped to int ---- > \n{}\n"
        .format(repr(sample),
                ", ".join([str(idx) for idx in sp.EncodeAsIds(sample)])))

示例#2

显示文件

文件： subword_tokenizer.py 项目： crazydigger/ru_summarization_mbart

class SubwordTokenizer(Tokenizer):
    def __init__(self,
                 model_path: str = None,
                 nbest_size: int = None,
                 alpha: float = None):
        self._model_path = cached_path(model_path)
        self._processor = SentencePieceProcessor()
        self._processor.Load(self._model_path)
        self._nbest_size = nbest_size
        self._alpha = alpha

    def tokenize(self, text: str) -> List[Token]:
        if self._nbest_size and self._alpha:
            subwords = self._processor.SampleEncodeAsPieces(text, self._nbest_size, self._alpha)
        else:
            subwords = self._processor.EncodeAsPieces(text)
        tokens = [Token(s) for s in subwords]
        return tokens

    def batch_tokenize(self, texts: List[str]) -> List[List[Token]]:
        return [self.tokenize(text) for text in texts]