Python normalize_text 예제들

프로그래밍 언어: Python

네임스페이스/패키지 이름: Utils.GeneralUtils

메소드/함수: normalize_text

hotexamples.com에서의 예제들: 2

Python normalize_text - 2개의 예제가 발견되었습니다. 이것들은 오픈소스 프로젝트에서 추출된 Python의 Utils.GeneralUtils.normalize_text에 대한 실세계 최고 등급의 예제들입니다. 예제들을 평가하여 예제의 품질 향상에 도움을 줄 수 있습니다.

예제 #1

파일 보기

def build_embedding(embed_file, targ_vocab, wv_dim):
    vocab_size = len(targ_vocab)
    emb = np.random.uniform(low=-1, high=1, size=(vocab_size, wv_dim))  # 随机编码的所有维度为-1~1之间的等概率分布
    emb[0] = 0  # 0号单词<PAD>的单词编码为全零
    w2id = {w: i for i, w in enumerate(targ_vocab)}
    lineCnt = 0
    with open(file=embed_file, encoding='utf-8') as f:  # 读入GloVe编码文件
        for line in f:
            lineCnt = lineCnt + 1
            if lineCnt % 100000 == 0:
                print('.', end='', flush=True)
            elems = line.split()
            token = normalize_text(''.join(elems[0: -wv_dim]))  # 文件每一列最后300列是编码，之前是单词字符串
            if token in w2id:  # 如果是词表中的单词，则将其编码特换为GloVe编码
                emb[w2id[token]] = [float(v) for v in elems[-wv_dim:]]
    return emb

예제 #2

파일 보기

파일: CoQAUtils.py 프로젝트: zmwebdev/quac

def build_embedding(embed_file, targ_vocab, wv_dim):
    vocab_size = len(targ_vocab)
    emb = np.random.uniform(-1, 1, (vocab_size, wv_dim))
    emb[0] = 0  # <PAD> should be all 0 (using broadcast)

    w2id = {w: i for i, w in enumerate(targ_vocab)}
    lineCnt = 0
    with open(embed_file, encoding="utf8") as f:
        for line in f:
            lineCnt = lineCnt + 1
            if lineCnt % 100000 == 0:
                print('.', end='', flush=True)
            elems = line.split()
            token = normalize_text(''.join(elems[0:-wv_dim]))
            if token in w2id:
                emb[w2id[token]] = [float(v) for v in elems[-wv_dim:]]
    return emb