Python Tokenizer.fit_transform 예제들

프로그래밍 언어: Python

네임스페이스/패키지 이름: Tokenizer

클래스/타입: Tokenizer

메소드/함수: fit_transform

hotexamples.com에서의 예제들: 2

Python Tokenizer.fit_transform - 2개의 예제가 발견되었습니다. 이것들은 오픈소스 프로젝트에서 추출된 Python의 Tokenizer.Tokenizer.fit_transform에 대한 실세계 최고 등급의 예제들입니다. 예제들을 평가하여 예제의 품질 향상에 도움을 줄 수 있습니다.

자주 사용되는 메소드들

보기 숨기기

Tokenizer(30)

advance(11)

getTokens(6)

has_more_tokens(6)

identifier(4)

key_word(3)

intVal(3)

getVocabSize(2)

fit_on_texts(2)

get_next_token(2)

get_value(2)

nextToken(2)

fit_transform(2)

getData(2)

get_next_non_whitespace(2)

is_operator(2)

joinSentences(1)

insert(1)

pop(1)

prepend(1)

toXML(1)

hasMoreTokens(1)

get_word_freq(1)

look2ahead(1)

get_tokens_from_file(1)

get_text_tokens(1)

lookahead(1)

tokenizeStr(1)

nltk_tokenize(1)

Tokenize(1)

getWordToInd(1)

context_window(1)

anchorScore(1)

build(1)

calculate_similarity(1)

ckip(1)

clean(1)

cleanText(1)

clear(1)

common_mentions(1)

common_terms(1)

convert_ids_to_tokens(1)

getWordMap(1)

execute(1)

generate(1)

getFixed(1)

getIndToWord(1)

getIterator(1)

getIterlimit(1)

getTestInput(1)

예제 #1

파일 보기

파일: trainer_v2.py 프로젝트: mohsinkhn/Jigsaw_toxic_comment_classifcation

    #NOrmalize text
    #for df in train, test:
    #    df["comment_text"] = normalizeString(df["comment_text"])
    #stemmer = PorterStemmer()
    #def custom_tokenize(text):
    #    tokens = wordpunct_tokenize(text)
    #    tokens = [stemmer.stem(token) for token in tokens]
    #    return tokens

    #Tokenize comments    S
    tok = Tokenizer(max_features=MAX_FEATURES,
                    max_len=MAX_LEN,
                    tokenizer=wordpunct_tokenize)
    X = tok.fit_transform(
        pd.concat([
            train_preproc["comment_text"].astype(str).fillna("na"),
            test_preproc["comment_text"].astype(str).fillna("na")
        ]))
    X_train = X[:len(train), :]
    X_test = X[len(train):, :]

    print(X_train.shape, X_test.shape)
    print("<+++++++>")
    print("Total words found by tokenizer in train and test are {}".format(
        len(tok.doc_freq)))
    print("Top 10 words in vocab are {}".format(tok.doc_freq.most_common(10)))
    print("Last 10 words to be used vocab with their freq are {}".format(
        tok.doc_freq.most_common(MAX_FEATURES)[-10:]))

    #Initialize embeddings
    embedding_matrix, oov_list = initialize_embeddings(EMBEDDING_FILE, tok)

예제 #2

파일 보기

파일: nblearn3.py 프로젝트: openerror/MultinomialNB

import numpy as np

if __name__ == "__main__":
    # Stop words come from the top few dozens frequent tokens identified by Tokenizer
    # All should be grammatical constructs with little semantic meaning
    stop_words_custom = [
        'a', 'and', 'the', 'is', 'am', 'are', 'he', 'she', 'it', 'to', 'an'
    ]

    #priors, training_documents, training_labels = generate_training_samples(sys.argv[1])
    priors, training_documents, training_labels = generate_training_samples(
        "op_spam_training_data/")

    # Build Tokenizer and turn training documents into integer tokens
    tok = Tokenizer(num_tokens=None, stop_words=stop_words_custom)
    tokenized_train = tok.fit_transform(training_documents)

    # Convert training samples and labels to numpy arrays
    X = list_to_numpy(tokenized_train, tok)
    y = np.asarray(training_labels)

    # Split off developmental data
    # Fixed random_state = 42 for DEBUG purposes
    X_train, X_test, y_train, y_test = train_test_split(X,
                                                        y,
                                                        test_size=0.25,
                                                        random_state=49)
    # Fit model on training data
    nb_clf = MultinomialNB()
    nb_clf.fit(X_train, y_train, alpha=0.9)