Python Vocab.from_docs 예제들

프로그래밍 언어: Python

네임스페이스/패키지 이름: data

클래스/타입: Vocab

메소드/함수: from_docs

hotexamples.com에서의 예제들: 2

Python Vocab.from_docs - 2개의 예제가 발견되었습니다. 이것들은 오픈소스 프로젝트에서 추출된 Python의 data.Vocab.from_docs에 대한 실세계 최고 등급의 예제들입니다. 예제들을 평가하여 예제의 품질 향상에 도움을 줄 수 있습니다.

자주 사용되는 메소드들

보기 숨기기

Vocab(30)

size(7)

word2id(5)

load(4)

from_data_files(4)

get_index(3)

decode(2)

encode(2)

examples(2)

NumIds(2)

from_docs(2)

build_from_emb(1)

encode_data(1)

getWordEmbedding(1)

i2s(1)

id2word(1)

build(1)

add_wordlist(1)

train_tokenizer(1)

LoadWordEmbedding(1)

예제 #1

파일 보기

 def __init__(self, docs, embeddings, cuda, word_dropout=0, max_len=-1):
     # print(docs)
     mini_vocab = Vocab.from_docs(docs,
                                  default=UNK_IDX,
                                  start=START_TOKEN_IDX,
                                  end=END_TOKEN_IDX)
     # Limit maximum document length (for efficiency reasons).
     if max_len != -1:
         docs = [doc[:max_len] for doc in docs]
     doc_lens = [len(doc) for doc in docs]
     self.doc_lens = cuda(torch.LongTensor(doc_lens))
     self.max_doc_len = max(doc_lens)
     if word_dropout:
         # for each token, with probability `word_dropout`, replace word index with UNK_IDX.
         docs = [[
             UNK_IDX if np.random.rand() < word_dropout else x for x in doc
         ] for doc in docs]
     # pad docs so they all have the same length.
     # we pad with UNK, whose embedding is 0, so it doesn't mess up sums or averages.
     docs = [
         right_pad(mini_vocab.numberize(doc), self.max_doc_len, UNK_IDX)
         for doc in docs
     ]
     self.docs = [cuda(fixed_var(torch.LongTensor(doc))) for doc in docs]
     local_embeddings = [embeddings[i] for i in mini_vocab.names]
     self.embeddings_matrix = cuda(
         fixed_var(FloatTensor(local_embeddings).t()))

예제 #2

파일 보기

    def test_denumberize_numberize(self):
        """ Tests that `denumberize` is left inverse of `numberize` """
        fixture1 = [["a", "b", "c"], ["d", "e", "f"], ["a", "f", "b"],
                    ["b", "e", "d"]]
        fixture2 = [[0, 1, 2], [3, 4, 5], [0, 5, 1], [2, 4, 3]]

        for fixture in (fixture1, fixture2):
            v = Vocab.from_docs(fixture)
            for doc in fixture:
                self.assertEqual(v.denumberize(v.numberize(doc)), doc)