Python Utility.clean_url 예제들

프로그래밍 언어: Python

네임스페이스/패키지 이름: src.utility

클래스/타입: Utility

메소드/함수: clean_url

hotexamples.com에서의 예제들: 2

Python Utility.clean_url - 2개의 예제가 발견되었습니다. 이것들은 오픈소스 프로젝트에서 추출된 Python의 src.utility.Utility.clean_url에 대한 실세계 최고 등급의 예제들입니다. 예제들을 평가하여 예제의 품질 향상에 도움을 줄 수 있습니다.

자주 사용되는 메소드들

보기 숨기기

Utility(11)

clean_url(2)

assess(1)

clean_date(1)

clean_stopping_words_and_phrases(1)

예제 #1

파일 보기

    def parse_urls(self, html):
        """
        Produces a list of URLs present in the given html.

        :type html: str
        :rtype:     list
        """

        soup = BeautifulSoup(html, "html.parser")
        urls = []

        # (presumably) only in the main page
        for element in soup.findAll("h2", {"class": "section-heading"}):
            if element.a:
                url = element.a.get("href")
                if url not in self.visited_urls:
                    urls.append(Utility.clean_url(url))

        # in main page and appears as relevant articles
        for element in soup.findAll("a", {"class": "story-link"}):
            url = element.get("href")
            if url not in self.visited_urls:
                urls.append(Utility.clean_url(url))

        return urls

예제 #2

파일 보기

파일: crawler.py 프로젝트: ibemitchy/ScrapeYard

    def parse_urls(self, html):
        """
        Appends new URLs present in the given html too the URL queue.

        :type html: str
        """

        soup = BeautifulSoup(html, "html.parser")

        # this is (presumably) only in the main page
        for element in soup.findAll("h2", {"class": "section-heading"}):
            if element.a:
                url = element.a.get("href")
                if url not in self.visited_urls:
                    self.url_queue.append(Utility.clean_url(url))

        # in main page and appear as relevant articles
        for element in soup.findAll("a", {"class": "story-link"}):
            url = element.get("href")
            if url not in self.visited_urls:
                self.url_queue.append(Utility.clean_url(url))