Python Tokenizer.tokenize 예제들

프로그래밍 언어: Python

네임스페이스/패키지 이름: bert4keras.utils

클래스/타입: Tokenizer

메소드/함수: tokenize

hotexamples.com에서의 예제들: 2

Python Tokenizer.tokenize - 2개의 예제가 발견되었습니다. 이것들은 오픈소스 프로젝트에서 추출된 Python의 bert4keras.utils.Tokenizer.tokenize에 대한 실세계 최고 등급의 예제들입니다. 예제들을 평가하여 예제의 품질 향상에 도움을 줄 수 있습니다.

자주 사용되는 메소드들

보기 숨기기

Tokenizer(10)

encode(5)

tokenize(2)

decode(1)

자주 사용되는 메소드들

Tokenizer (10)

encode (5)

tokenize (2)

decode (1)

예제 #1

파일 보기

파일: task_sentiment_albert.py 프로젝트: lgstd/bert4keras

maxlen = 100
config_path = '/root/kg/bert/albert_base_zh/bert_config.json'
checkpoint_path = '/root/kg/bert/albert_base_zh/bert_model.ckpt'
dict_path = '/root/kg/bert/albert_base_zh/vocab.txt'

neg = pd.read_excel('datasets/neg.xls', header=None)
pos = pd.read_excel('datasets/pos.xls', header=None)
data, tokens = [], {}

_token_dict = load_vocab(dict_path)  # 读取词典
_tokenizer = Tokenizer(_token_dict)  # 建立临时分词器

for d in neg[0]:
    data.append((d, 0))
    for t in _tokenizer.tokenize(d):
        tokens[t] = tokens.get(t, 0) + 1

for d in pos[0]:
    data.append((d, 1))
    for t in _tokenizer.tokenize(d):
        tokens[t] = tokens.get(t, 0) + 1

tokens = {i: j for i, j in tokens.items() if j >= 4}
token_dict, keep_words = {}, []  # keep_words是在bert中保留的字表

for t in ['[PAD]', '[UNK]', '[CLS]', '[SEP]']:
    token_dict[t] = len(token_dict)
    keep_words.append(_token_dict[t])

for t in tokens:

예제 #2

파일 보기

    for t in txt.split('  '):
        for s in re.findall(u'.*?。', t):
            if len(s) <= maxlen - 2:
                sents.append(s)
    novels.append(sents)

_token_dict = load_vocab(dict_path)  # 读取词典
_tokenizer = Tokenizer(_token_dict)  # 建立临时分词器

if os.path.exists(lm_config):
    tokens = json.load(open(lm_config))
else:
    tokens = {}
    for novel in novels:
        for s in novel:
            for t in _tokenizer.tokenize(s):
                tokens[t] = tokens.get(t, 0) + 1
    tokens = [(i, j) for i, j in tokens.items() if j >= min_count]
    tokens = sorted(tokens, key=lambda t: -t[1])
    tokens = [t[0] for t in tokens]
    json.dump(tokens,
              codecs.open(lm_config, 'w', encoding='utf-8'),
              indent=4,
              ensure_ascii=False)

token_dict, keep_words = {}, []  # keep_words是在bert中保留的字表

for t in ['[PAD]', '[UNK]', '[CLS]', '[SEP]']:
    token_dict[t] = len(token_dict)
    keep_words.append(_token_dict[t])