Python HashingVectorizer.build_analyzer 예제들

프로그래밍 언어: Python

네임스페이스/패키지 이름: sklearn.feature_extraction.text

클래스/타입: HashingVectorizer

메소드/함수: build_analyzer

hotexamples.com에서의 예제들: 3

Python HashingVectorizer.build_analyzer - 3개의 예제가 발견되었습니다. 이것들은 오픈소스 프로젝트에서 추출된 Python의 sklearn.feature_extraction.text.HashingVectorizer.build_analyzer에 대한 실세계 최고 등급의 예제들입니다. 예제들을 평가하여 예제의 품질 향상에 도움을 줄 수 있습니다.

자주 사용되는 메소드들

보기 숨기기

HashingVectorizer(30)

transform(30)

fit(30)

fit_transform(30)

get_feature_names(9)

toarray(5)

build_analyzer(3)

get_stop_words(2)

__init__(2)

build_tokenizer(2)

todense(1)

get_feature_names_out(1)

get_params(1)

get_glove_vectors(1)

get_features_name(1)

__dict__(1)

build_preprocessor(1)

transformat(1)

예제 #1

파일 보기

파일: reading.py 프로젝트: ninextycode/finalYearProjectNMF

class HashTfidfVectoriser:
    def __init__(self, n_features):
        self.hashing_vectoriser = HashingVectorizer(n_features=n_features,
                                                    alternate_sign=False)
        self.tfidf_transformer = TfidfTransformer()
        self.words_by_hashes_dict = {}
        self.last_data = None

    def words_by_hash(self, hash):
        return self.words_by_hashes_dict[hash]

    def fit_transform(self, data):
        self.last_data = data[:]
        for i in range(len(data)):
            data[i] = re.sub("\d+", "", data[i])

        self.words_by_hashes_dict = {}

        words_list = self.hashing_vectoriser.build_analyzer()("\n".join(data))
        unique_words = set(words_list)
        hashes = self.hashing_vectoriser.transform(unique_words).indices

        for w, h in zip(unique_words, hashes):
            old_list = self.words_by_hashes_dict.get(h, [])
            old_list.append(w)
            self.words_by_hashes_dict[h] = old_list

        return self.tfidf_transformer.fit_transform(
            self.hashing_vectoriser.fit_transform(data))

예제 #2

파일 보기

파일: datacleanprocess.py 프로젝트: wuyufei2016/newcode

def cleaner_str(s):
    cleaner = HashingVectorizer(decode_error = 'ignore',
                           analyzer = 'word',
                           ngram_range = (1,1),
                           stop_words = 'english')
    c = cleaner.build_analyzer()
    s = (" ").join(c(s))
    return s

예제 #3

파일 보기

# This mapping is completely stateless and the dimensionality of the output space is explicitly fixed in advance (here we use a modulo `2 ** 20` which means roughly 1M dimensions). The makes it possible to workaround the limitations of the vocabulary based vectorizer both for parallelizability and online / out-of-core learning.

# %% [markdown] {"deletable": true, "editable": true}
# The `HashingVectorizer` class is an alternative to the `CountVectorizer` (or `TfidfVectorizer` class with `use_idf=False`) that internally uses the murmurhash hash function:

# %% {"deletable": true, "editable": true}
from sklearn.feature_extraction.text import HashingVectorizer

h_vectorizer = HashingVectorizer(encoding='latin-1')
h_vectorizer

# %% [markdown] {"deletable": true, "editable": true}
# It shares the same "preprocessor", "tokenizer" and "analyzer" infrastructure:

# %% {"deletable": true, "editable": true}
analyzer = h_vectorizer.build_analyzer()
analyzer('This is a test sentence.')

# %% [markdown] {"deletable": true, "editable": true}
# We can vectorize our datasets into a scipy sparse matrix exactly as we would have done with the `CountVectorizer` or `TfidfVectorizer`, except that we can directly call the `transform` method: there is no need to `fit` as `HashingVectorizer` is a stateless transformer:

# %% {"deletable": true, "editable": true}
docs_train, y_train = train['data'], train['target']
docs_valid, y_valid = test['data'][:12500], test['target'][:12500]
docs_test, y_test = test['data'][12500:], test['target'][12500:]

# %% [markdown] {"deletable": true, "editable": true}
# The dimension of the output is fixed ahead of time to `n_features=2 ** 20` by default (nearly 1M features) to minimize the rate of collision on most classification problem while having reasonably sized linear models (1M weights in the `coef_` attribute):

# %% {"deletable": true, "editable": true}
h_vectorizer.transform(docs_train)