Python PreTrainedTokenizerFast.unk_tokenの例

プログラミング言語: Python

名前空間/パッケージ名: transformers

メソッド/関数: unk_token

hotexamples.comのコード掲載数: 2

Python PreTrainedTokenizerFast.unk_token - 2件のコード例が見つかりました。すべてオープンソースプロジェクトから抽出されたPythonのtransformers.PreTrainedTokenizerFast.unk_tokenの実例で、最も評価が高いものを厳選しています。コード例の評価を行っていただくことで、より質の高いコード例が表示されるようになります。

よく使われるメソッド

表示非表示

PreTrainedTokenizerFast(28)

from_pretrained(25)

encode(6)

add_special_tokens(5)

pad_token(5)

encode_plus(4)

mask_token(4)

decode(3)

tokenize(3)

batch_decode(2)

cls_token(2)

convert_ids_to_tokens(2)

convert_tokens_to_ids(2)

save_pretrained(2)

sep_token(2)

unk_token(2)

get_vocab(1)

num_special_tokens_to_add(1)

コード例 #1

ファイルを表示

ファイル: preprocess.py プロジェクト: antklen/data_fusion_solution

def preprocess(texts, tokenizer_path, max_len=32):

    input_ids, input_masks = [], []

    tokenizer = PreTrainedTokenizerFast(tokenizer_file=tokenizer_path)
    tokenizer.mask_token = '[MASK]'
    tokenizer.pad_token = "[PAD]"
    tokenizer.sep_token = "[SEP]"
    tokenizer.cls_token = "[CLS]"
    tokenizer.unk_token = "[UNK]"

    for text in tqdm(texts):
        encoded = tokenizer.encode_plus(text,
                                        max_length=max_len,
                                        pad_to_max_length=True,
                                        truncation=True)
        input_ids.append(encoded['input_ids'])
        input_masks.append(encoded['attention_mask'])

    return [np.array(input_ids), np.array(input_masks)]

コード例 #2

ファイルを表示

ファイル: train_lm.py プロジェクト: antklen/data_fusion_solution

parser = argparse.ArgumentParser(description='Training language model')
parser.add_argument('--config_path', type=str, default='src/configs/train_lm1.yaml',
                    help='path to config file')
args = parser.parse_args()

config = OmegaConf.load(args.config_path)
print(OmegaConf.to_yaml(config))

os.environ['WANDB_DISABLED'] = 'true'

tokenizer = PreTrainedTokenizerFast(tokenizer_file=config.tokenizer_path)
tokenizer.mask_token = '[MASK]'
tokenizer.pad_token = "[PAD]"
tokenizer.sep_token = "[SEP]"
tokenizer.cls_token = "[CLS]"
tokenizer.unk_token = "[UNK]"

distilbert_config = DistilBertConfig(vocab_size=config.vocab_size,
                                     n_heads=8, dim=512, hidden_dim=2048)
model = DistilBertForMaskedLM(distilbert_config)

dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path=DATA_PATH,
    block_size=64)
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=True,
    mlm_probability=config.mlm_probability)

training_args = TrainingArguments(