Python HashingVectorizer.build_tokenizer 예제들

프로그래밍 언어: Python

네임스페이스/패키지 이름: sklearn.feature_extraction.text

클래스/타입: HashingVectorizer

메소드/함수: build_tokenizer

hotexamples.com에서의 예제들: 2

Python HashingVectorizer.build_tokenizer - 2개의 예제가 발견되었습니다. 이것들은 오픈소스 프로젝트에서 추출된 Python의 sklearn.feature_extraction.text.HashingVectorizer.build_tokenizer에 대한 실세계 최고 등급의 예제들입니다. 예제들을 평가하여 예제의 품질 향상에 도움을 줄 수 있습니다.

자주 사용되는 메소드들

보기 숨기기

HashingVectorizer(30)

transform(30)

fit(30)

fit_transform(30)

get_feature_names(9)

toarray(5)

build_analyzer(3)

get_stop_words(2)

__init__(2)

build_tokenizer(2)

todense(1)

get_feature_names_out(1)

get_params(1)

get_glove_vectors(1)

get_features_name(1)

__dict__(1)

build_preprocessor(1)

transformat(1)

예제 #1

파일 보기

파일: clustering2.py 프로젝트: rangeonnicolas/MOPA

    stop_words = get_stop_words('fr')
    stop_words.extend(ADDITIONAL_STOP_WORDS)

    # Un HashingVectorizer découpe un texte en une liste de mots, et renvoie une matrice où chaque ligne correspond à
    # un document et chaque colonne à un mot
    # doc : http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html
    hasher = HashingVectorizer(strip_accents='unicode',
                               stop_words=stop_words,
                               norm=None)

    # Le HashingVectorizer ne permet pas la stemmisation des mots durant le processus de tokenisation.
    # On va donc lui dire de le faire quand même.
    # Pour cela, on récupère sa fonction de tokenisation, que l'on va améliorer, puis lui réinjecter:

    original_tokenizer = hasher.build_tokenizer(
    )  # recuperation de la fonction de tokenisation
    stemmer = SnowballStemmer("french", ignore_stopwords=True)

    def new_tokenizer(text):
        words = original_tokenizer(text)
        stemmed_words = [stemmer.stem(w) for w in words]
        return stemmed_words

    hasher = HashingVectorizer(
        tokenizer=
        new_tokenizer,  # création d'un nouveau hasher avec injection de notre tokenizer amélioré
        strip_accents='unicode',
        stop_words=stop_words,
        norm=None)

    # Un pipeline est juste une liste dans laquelle on place différents processeurs.

예제 #2

파일 보기

파일: plot_out_of_core_classification.py 프로젝트: watereals/ShallowLearn

            yield doc


###############################################################################
# Main
# ----
#
# Create the vectorizer and limit the number of features to a reasonable
# maximum

N_FEATURES = 2 ** 18

vectorizer = HashingVectorizer(decode_error='ignore', n_features=N_FEATURES,
                               non_negative=True)

tokenizer = vectorizer.build_tokenizer()
preprocessor = vectorizer.build_preprocessor()
stop_words = vectorizer.get_stop_words()


def tokenize(text):
    return vectorizer._word_ngrams(tokenizer(preprocessor(vectorizer.decode(text))), stop_words)


# Iterator over parsed Reuters SGML files.
data_stream = stream_reuters_documents()

# We learn a binary classification between the "acq" class and all the others.
# "acq" was chosen as it is more or less evenly distributed in the Reuters
# files. For other datasets, one should take care of creating a test set with
# a realistic portion of positive instances.