Python Preprocessor.get_processed_sentence 예제들

프로그래밍 언어: Python

네임스페이스/패키지 이름: src.helpers.class_preprocessor

클래스/타입: Preprocessor

메소드/함수: get_processed_sentence

hotexamples.com에서의 예제들: 2

Python Preprocessor.get_processed_sentence - 2개의 예제가 발견되었습니다. 이것들은 오픈소스 프로젝트에서 추출된 Python의 src.helpers.class_preprocessor.Preprocessor.get_processed_sentence에 대한 실세계 최고 등급의 예제들입니다. 예제들을 평가하여 예제의 품질 향상에 도움을 줄 수 있습니다.

자주 사용되는 메소드들

보기 숨기기

load_models(13)

get_processed_sentence(2)

get_processed_tokens(2)

strip_beginning(1)

예제 #1

파일 보기

파일: mead_summary_generator.py 프로젝트: zoew2/CGJZ

    def get_idf_array(self):
        """
        Use external corpus to get IDF scores
        for cluster centroid calculations
        :return: numpy array of idf values
        """
        corpus = brown
        if self.args.corpus == 'R':
            corpus = reuters
        num_words = Vectors().num_unique_words
        n = len(corpus.fileids())  # number of documents in corpus
        docs_word_matrix = np.zeros([n, num_words])
        for doc_idx, doc_id in enumerate(corpus.fileids()):
            sentences = list(corpus.sents(doc_id))
            words_in_doc = set()
            for s in sentences:
                s = ' '.join(s)
                proc_s = Preprocessor.get_processed_tokens(Preprocessor.get_processed_sentence(s))
                if proc_s:
                    words_in_doc = words_in_doc.union(proc_s)
            for word in words_in_doc:
                word_idx = WordMap.id_of(word)
                if word_idx:
                    docs_word_matrix[doc_idx, word_idx] = 1

        docs_per_word = np.sum(docs_word_matrix, axis=0)
        self.idf_array = np.log10(np.divide(n, docs_per_word + 1))  # add one to avoid divide by zero error

        return self.idf_array

예제 #2

파일 보기

파일: class_sentence.py 프로젝트: zoew2/CGJZ

    def __init__(self, raw_sentence, sent_pos, doc_id=None):
        """
        initialize Sentence class with methods for plain/raw and tokenized sentence
        options, word count, position of sentence in document and document id
        :param raw_sentence:
        :param sent_pos:
        """
        self.raw_sentence = ' '.join(raw_sentence.rstrip().split())
        self.raw_sentence = Preprocessor.strip_beginning(self.raw_sentence)
        self.tokens = []

        self.processed = Preprocessor.get_processed_sentence(self.raw_sentence)
        self.__tokenize_sentence(self.processed)

        self.sent_pos = int(sent_pos)  # position of sentence in document
        self.doc_id = doc_id
        self.vector = []  # placeholder
        self.order_by = self.sent_pos
        self.c_score = self.p_score = self.f_score = self.mead_score = self.lda_scores = self.melda_scores = None
        self.compressed = self.raw_sentence

        # update global mapping of words to indices
        WordMap.add_words(
            self.tokens)  # make sure self.tokens is the right thing here