Python Vocabulary.add_tokenの例

プログラミング言語: Python

名前空間/パッケージ名: vocabulary

クラス/型: Vocabulary

メソッド/関数: add_token

hotexamples.comのコード掲載数: 3

Python Vocabulary.add_token - 3件のコード例が見つかりました。すべてオープンソースプロジェクトから抽出されたPythonのvocabulary.Vocabulary.add_tokenの実例で、最も評価が高いものを厳選しています。コード例の評価を行っていただくことで、より質の高いコード例が表示されるようになります。

よく使われるメソッド

表示非表示

Vocabulary(30)

add_word(15)

clean_text(8)

build_vocab(8)

add_words(8)

deserialize(7)

compile(4)

add(4)

antonym(4)

auto_punctuate(3)

add_token(3)

encode(3)

add_from_file(2)

decode_output(2)

getUniGrams(2)

from_documents(2)

build_corpus(2)

getVocabularyByDocument(2)

getBiGrams(2)

get_id_from_token(2)

add_a_word(2)

add_text(2)

add_many(2)

getFullDict(2)

gen_DAG(1)

from_text_files(1)

from_text(1)

from_serializable(1)

from_sentences(1)

get(1)

add_constant(1)

getPTStopWords(1)

getQuestions(1)

getVocabularySize(1)

get_all_source_words(1)

get_all_translations(1)

get_pos(1)

get_term_text(1)

make_dictionary(1)

seg_content(1)

from_nlp_data(1)

encode_sent(1)

from_idx2word_dict(1)

convert_sentence(1)

add_new_word(1)

add_sentence(1)

add_chunk(1)

add_word_lst(1)

append(1)

build(1)

コード例 #1

ファイルを表示

ファイル: review_vectorizer.py プロジェクト: Jsiewierski11/Dataclass_intro

    def from_dataframe(cls, review_df, cutoff=25):
        """
        Instantiate the vectorizer from the dataset dataframe.

        Args:
            review_df (pandas.Dataframe): the serializable dictionary
        Returns:
            an instance of the ReviewVectorizer
        """
        review_vocab = Vocabulary(add_unk=True)
        rating_vocab = Vocabulary(add_unk=False)

        # Add ratings
        for rating in sorted(set(review_df.rating)):
            rating_vocab.add_token(rating)

        # Add top words if count > provided count
        word_counts = Counter()
        for review in review_df.review:
            for word in review.split(" "):
                if word not in string.punctuation:
                    word_counts[word] += 1

        for word, count in word_counts.items():
            if count > cutoff:
                review_vocab.add_token(word)

        return cls(review_vocab, rating_vocab)

コード例 #2

ファイルを表示

ファイル: vectorizer.py プロジェクト: rrajasek95/ebert

    def __init__(self,
                 vocabulary: Vocabulary,
                 tokenizer=split_tokenizer,
                 init_token=None,
                 eos_token=None,
                 pad_token=None,
                 reverse=False):
        self.vocab = vocabulary

        if init_token:
            self.init_idx = vocabulary.add_token(init_token)
            self.init_token = init_token
            self.init_present = 1
        else:
            self.init_present = 0

        if eos_token:
            self.eos_idx = vocabulary.add_token(eos_token)
            self.eos_token = eos_token
            self.eos_present = 1
        else:
            self.eos_present = 0

        if pad_token:
            self.pad_idx = vocabulary.add_token(pad_token)

        self.tokenizer = tokenizer
        self.reverse = reverse

コード例 #3

ファイルを表示

    # DATA FILES #
    train_loc = locations['train_loc']
    dev_loc = locations['test_loc']
    fasttext_loc = locations['embeddings_loc']
    w2vec_loc = locations['w2vec_loc']
    model_loc = locations['model_loc']
    stopwordsfile = locations['stopwordsfile']

    # VOCABULARY #
    special_tokens = [INIT_TOKEN, UNK_TOKEN, END_TOKEN, PAD_TOKEN]
    with open(train_loc) as f:
        raw_text = f.read()
    voc = Vocabulary(raw_text, bigram=bigram)
    voc.prune(threshold=1)
    for token in special_tokens:
        voc.add_token(token)
    w2idx = voc.w2idx
    idx2w = voc.idx2w
    voc_size = voc.get_length()
    pad_idx = w2idx[PAD_TOKEN]
    init_idx = w2idx[INIT_TOKEN]

    # STOP WORDS #
    with open(stopwordsfile) as f:
        stop_words = f.read().split()
    stop_words.extend(special_tokens)
    stop_idx = [w2idx[w] for w in stop_words if w in w2idx.keys()]

    # PRE-TRAINED EMBEDDINGS #
    if os.path.exists(w2vec_loc):
        with open(w2vec_loc, 'rb') as f: