Python SentencePieceProcessor.SampleEncodeAsPiecesの例

プログラミング言語: Python

名前空間/パッケージ名: sentencepiece

メソッド/関数: SampleEncodeAsPieces

hotexamples.comのコード掲載数: 2

Python SentencePieceProcessor.SampleEncodeAsPieces - 2件のコード例が見つかりました。すべてオープンソースプロジェクトから抽出されたPythonのsentencepiece.SentencePieceProcessor.SampleEncodeAsPiecesの実例で、最も評価が高いものを厳選しています。コード例の評価を行っていただくことで、より質の高いコード例が表示されるようになります。

よく使われるメソッド

表示非表示

Load(30)

SentencePieceProcessor(30)

PieceToId(14)

EncodeAsIds(12)

EncodeAsPieces(12)

IdToPiece(6)

DecodeIds(5)

GetPieceSize(4)

SampleEncodeAsPieces(2)

Decode(1)

DecodePieces(1)

EncodeAsSerializedProto(1)

コード例 #1

ファイルを表示

ファイル: tokenizers.py プロジェクト: gretelai/gretel-synthetics

def _log_sample_data(model_dir: str, sp: spm.SentencePieceProcessor):
    training_data_path = Path(model_dir) / const.TRAINING_DATA
    if not training_data_path.is_file():
        logging.info("Training data not found for SP sampling")
        return

    with open(training_data_path) as fin:
        sample = fin.readline().strip()

    logging.info(f"Tokenizer model vocabulary size: {len(sp)} tokens")
    logging.info(
        "Mapping first line of training data\n\n{}\n ---- sample tokens mapped to pieces ---- > \n{}\n"
        .format(repr(sample),
                ", ".join(sp.SampleEncodeAsPieces(sample, -1, 0.1))))
    logging.info(
        "Mapping first line of training data\n\n{}\n ---- sample tokens mapped to int ---- > \n{}\n"
        .format(repr(sample),
                ", ".join([str(idx) for idx in sp.EncodeAsIds(sample)])))

コード例 #2

ファイルを表示

ファイル: subword_tokenizer.py プロジェクト: crazydigger/ru_summarization_mbart

class SubwordTokenizer(Tokenizer):
    def __init__(self,
                 model_path: str = None,
                 nbest_size: int = None,
                 alpha: float = None):
        self._model_path = cached_path(model_path)
        self._processor = SentencePieceProcessor()
        self._processor.Load(self._model_path)
        self._nbest_size = nbest_size
        self._alpha = alpha

    def tokenize(self, text: str) -> List[Token]:
        if self._nbest_size and self._alpha:
            subwords = self._processor.SampleEncodeAsPieces(text, self._nbest_size, self._alpha)
        else:
            subwords = self._processor.EncodeAsPieces(text)
        tokens = [Token(s) for s in subwords]
        return tokens

    def batch_tokenize(self, texts: List[str]) -> List[List[Token]]:
        return [self.tokenize(text) for text in texts]