Python Document.collection 예제들

프로그래밍 언어: Python

네임스페이스/패키지 이름: Document

클래스/타입: Document

메소드/함수: collection

hotexamples.com에서의 예제들: 1

Python Document.collection - 1개의 예제가 발견되었습니다. 이것들은 오픈소스 프로젝트에서 추출된 Python의 Document.Document.collection에 대한 실세계 최고 등급의 예제들입니다. 예제들을 평가하여 예제의 품질 향상에 도움을 줄 수 있습니다.

자주 사용되는 메소드들

보기 숨기기

Document(30)

all_sentences(11)

__str__(5)

__init__(4)

append(3)

addMention(2)

numOfWords(2)

generateWhole(2)

factory(2)

edit(2)

addMeSH(1)

get_candidates(1)

generate_candidate_anaphor_data(1)

generate_candidate_mention_pairs(1)

generate_document(1)

generate_gold_anaphor_data(1)

generate_gold_mention_pairs(1)

get(1)

getID(1)

getIdentifiant(1)

getUID(1)

get_article(1)

get_clean(1)

from_json(1)

get_cls_byname(1)

get_cluster_data(1)

get_stems(1)

name(1)

__dict__(1)

save_collection(1)

set_body_length(1)

set_url(1)

termFrequency(1)

to_json(1)

write2DB(1)

_edit(1)

from_data_frame(1)

addLien(1)

build_n_grams(1)

addRef(1)

addTexte(1)

addTitre(1)

add_anchor_text(1)

add_body_hits(1)

add_sentence(1)

allDocumentsID(1)

addDocument(1)

addAuteur(1)

availableReplacements(1)

calculate_vectors(1)

예제 #1

파일 보기

파일: Crawlers.py 프로젝트: zemingsmu/WebCrawler

    def parse(self, url, file_type, file_content):
        # Parse the file as a HTML file.
        # Reference from: https://stackoverflow.com/questions
        #   30565404/remove-all-style-scripts-and-html-tags-from-an-html-page
        text = file_content
        title = ''
        if 'html' in file_type:
            # Clean the file. Don't save HTML markup
            soup = BeautifulSoup(file_content, 'html.parser')
            # Remove all javascript and stylesheet code.
            for script in soup(["script", "style"]):
                script.extract()

            title = soup.title.string  # Get the title of this file.
            # print("The title of this file is: ", title)
            text = soup.body.get_text()  # Get the body of this file.

        lines = (line.strip() for line in text.splitlines())
        # Build a chunk of tokens.
        chunks = []
        for line in lines:
            for phrase in line.split(" "):  # Split with space.
                chunks.append(phrase.strip())
        # Drop blank lines.
        text = '\n'.join(chunk for chunk in chunks if chunk)

        # Write to a file.
        self.doc_id += 1
        filename = "Doc#" + str(self.doc_id) + '.txt'
        # Ensure the file will closed.
        with open(filename, 'w', encoding='utf-8') as f:
            f.write(text)

        # I only give id to document I'm gonna parse.
        document = Document(url, self.doc_id, filename, file_type,
                            self.stop_words)
        document.filter()
        document.stem()
        document.collection()
        # print("There're", len(document.term), "terms in document", filename)

        if 'html' in file_type:
            document.set_title(title)

        # Duplicate Detection
        for d in self.docs:
            if self.duplicate_detection(d, document) == 1:
                # print("The content of Doc#{} is exact duplicate with Doc#{}, so, we won't parse Doc#{}."
                #       .format(document.get_id(), d.get_id(), document.get_id()))
                self.url_already_seen = self.url_already_seen.union(
                    {str(document.get_url())})
                return False
        self.docs.append(document)
        return True