Python WsTknzr.build_vocab示例

编程语言: Python

命名空间/包名称: lmp.tknzr

类/类型: WsTknzr

方法/功能: build_vocab

hotexamples.com的示例: 2

Python WsTknzr.build_vocab - 已找到2个示例。这些是从开源项目中提取的最受好评的lmp.tknzr.WsTknzr.build_vocab现实Python示例。您可以评价示例，以帮助我们提高示例质量。

常用方法

显示隐藏

WsTknzr(16)

norm(4)

save(3)

build_vocab(2)

tknz(2)

batch_dec(1)

batch_enc(1)

dec(1)

dtknz(1)

enc(1)

load(1)

示例#1

显示文件

文件： test_tensor_dset.py 项目： ProFatXuanAll/language-model-playground

def test_slow_tensor_dset(max_seq_len: int) -> None:
  """Load dataset and convert to tensor on the fly."""
  tknzr = WsTknzr(is_uncased=True, max_vocab=-1, min_count=10)
  tknzr.build_vocab(batch_txt=['a', 'b', 'c'])

  wiki_dset = WikiText2Dset(ver='valid')

  dset = lmp.util.dset.SlowTensorDset(dset=wiki_dset, max_seq_len=max_seq_len, tknzr=tknzr)

  assert isinstance(dset, lmp.util.dset.SlowTensorDset)
  assert len(dset) == len(wiki_dset)
  for idx, tkids in enumerate(dset):
    assert isinstance(tkids, torch.Tensor), 'Each sample in the tensor dataset must be tensor.'
    assert tkids.size() == torch.Size([max_seq_len]), 'Each sample in the tensor dataset must have same length.'
    assert torch.all(dset[idx] == tkids), 'Support ``__getitem__`` and ``__iter__``.'

示例#2

显示文件

文件： test_build_vocab.py 项目： shangrex/language-model-playground

def test_build_vocab(
    parameters,
    test_input: Sequence[str],
    expected: Dict[str, int],
):
    r"""Correctly build vocabulary under the constraint of given parameters."""

    tknzr = WsTknzr(
        is_uncased=parameters['is_uncased'],
        max_vocab=parameters['max_vocab'],
        min_count=parameters['min_count'],
        tk2id=parameters['tk2id'],
    )

    tknzr.build_vocab(test_input)

    assert tknzr.tk2id == expected