Python Preprocessor.get_processed_tokensの例

プログラミング言語: Python

名前空間/パッケージ名: src.helpers.class_preprocessor

クラス/型: Preprocessor

メソッド/関数: get_processed_tokens

hotexamples.comのコード掲載数: 2

Python Preprocessor.get_processed_tokens - 2件のコード例が見つかりました。すべてオープンソースプロジェクトから抽出されたPythonのsrc.helpers.class_preprocessor.Preprocessor.get_processed_tokensの実例で、最も評価が高いものを厳選しています。コード例の評価を行っていただくことで、より質の高いコード例が表示されるようになります。

よく使われるメソッド

表示非表示

load_models(13)

get_processed_sentence(2)

get_processed_tokens(2)

strip_beginning(1)

コード例 #1

ファイルを表示

ファイル: class_sentence.py プロジェクト: zoew2/CGJZ

 def __tokenize_sentence(self, processed):
     """
     tokenize sentence and remove sentence-level punctuation,
     such as comma (,) but not dash (-) in, e.g. 'morning-after'
     function only for internal usage
     """
     self.tokens = Preprocessor.get_processed_tokens(processed)

コード例 #2

ファイルを表示

ファイル: mead_summary_generator.py プロジェクト: zoew2/CGJZ

    def get_idf_array(self):
        """
        Use external corpus to get IDF scores
        for cluster centroid calculations
        :return: numpy array of idf values
        """
        corpus = brown
        if self.args.corpus == 'R':
            corpus = reuters
        num_words = Vectors().num_unique_words
        n = len(corpus.fileids())  # number of documents in corpus
        docs_word_matrix = np.zeros([n, num_words])
        for doc_idx, doc_id in enumerate(corpus.fileids()):
            sentences = list(corpus.sents(doc_id))
            words_in_doc = set()
            for s in sentences:
                s = ' '.join(s)
                proc_s = Preprocessor.get_processed_tokens(Preprocessor.get_processed_sentence(s))
                if proc_s:
                    words_in_doc = words_in_doc.union(proc_s)
            for word in words_in_doc:
                word_idx = WordMap.id_of(word)
                if word_idx:
                    docs_word_matrix[doc_idx, word_idx] = 1

        docs_per_word = np.sum(docs_word_matrix, axis=0)
        self.idf_array = np.log10(np.divide(n, docs_per_word + 1))  # add one to avoid divide by zero error

        return self.idf_array