Python Page.get_urls 예제들

프로그래밍 언어: Python

네임스페이스/패키지 이름: crawler

클래스/타입: Page

메소드/함수: get_urls

hotexamples.com에서의 예제들: 2

Python Page.get_urls - 2개의 예제가 발견되었습니다. 이것들은 오픈소스 프로젝트에서 추출된 Python의 crawler.Page.get_urls에 대한 실세계 최고 등급의 예제들입니다. 예제들을 평가하여 예제의 품질 향상에 도움을 줄 수 있습니다.

자주 사용되는 메소드들

보기 숨기기

Page(3)

get_urls(2)

자주 사용되는 메소드들

Page (3)

get_urls (2)

예제 #1

파일 보기

파일: Crawler.py 프로젝트: pharmajoe990/recipe_scraper

    def build_crawl_list(self):
        """
        Build a list of all of the URLs based on the depth specified.
        """
        current_depth = 1
        page = requests.get(self.base_url).text
        if self.children > 0:
            self.urls = Page.get_urls(page)[:self.children]
        else:
            self.urls = Page.get_urls(page)
        # Below list holds previously scanned URLs, to stop URLs being added twice
        scanned_urls = []
        while current_depth <= self.depth:
            # Append the links for each page then search it for more
            print 'Starting crawl depth', current_depth, 'with', len(self.urls), 'URLs to scan'
            new_urls = []
            for url in self.urls:
                # If the url is not already scanned, and if it is not an image, xml etc. scan it.
                if url not in scanned_urls:
                    if TasteDotCom.is_wanted_object(url):
                        print 'Looking for child URLs in ', url
                        markup = requests.get(url).text
                        scanned_urls.append(url)
                        if self.children > 0:
                            new_urls = Page.get_urls(markup)[:self.children]
                        else:
                            new_urls = Page.get_urls(markup)
            print 'Found', len(new_urls), 'new pages'
            # for url in new_urls:
            #     check_and_add(url)
            self.urls += new_urls
            current_depth += 1
        print 'Finished crawling', self.base_url, 'found', len(self.urls), 'total URLs'

    # def run(self):
    #     """
    #     Start Crawling the page specified
    #     """
    #     #todo Make use of this method
    #     print "Starting crawl session for", self.base_url
    #     page = requests.get(self.base_url).text
    #     child_urls = Page.get_urls(page)
    #     for url in child_urls:
    #         self.check_and_add(url)

# def check_and_add(url):
#     pass

예제 #2

파일 보기

파일: PageTest.py 프로젝트: pharmajoe990/recipe_scraper

 def test_get_urls(self):
     # Use a known static page for testing
     html = file("../miscellany/sausage_and_punpkin_curry.html").read()
     urls = Page.get_urls(html)
     self.assertEquals(len(urls), 387)
     a = re.compile(r"http://www\.[//\a\w\.\+]+")
     for url in urls:
         # Check each URL matches a hyperlink pattern
         self.assertTrue(a.match(url))