Python HtmlParser.get_assets 예제들

프로그래밍 언어: Python

네임스페이스/패키지 이름: html_parser

클래스/타입: HtmlParser

메소드/함수: get_assets

hotexamples.com에서의 예제들: 2

Python HtmlParser.get_assets - 2개의 예제가 발견되었습니다. 이것들은 오픈소스 프로젝트에서 추출된 Python의 html_parser.HtmlParser.get_assets에 대한 실세계 최고 등급의 예제들입니다. 예제들을 평가하여 예제의 품질 향상에 도움을 줄 수 있습니다.

자주 사용되는 메소드들

보기 숨기기

HtmlParser(30)

city_parser(3)

county_parser(3)

feed(2)

get_assets(2)

close(1)

contextParer(1)

extract_url(1)

get_answer_count(1)

get_article_count(1)

get_ask_question_count(1)

get_brief_info(1)

get_collection_count(1)

get_education(1)

write_to_file(1)

예제 #1

파일 보기

 def test_get_assets(self):
     """
     Tests get_file_contents method
     """
     file_util = FileUtil()
     expected_assets = file_util.get_file_contents("assets_test_data.txt")
     html_parser = HtmlParser()
     urls = file_util.get_file_contents("same_hostname_urls_test_data.txt")
     actual_assets = html_parser.get_assets(urls)
     self.assertEqual(expected_assets, actual_assets)

예제 #2

파일 보기

파일: web_crawler.py 프로젝트: kitlawes/web-crawler

class WebCrawler:
    def __init__(self):
        self.url_util = UrlUtil()
        self.html_requester = HtmlRequester()
        self.html_parser = HtmlParser()

    def crawl(self, url):
        """
        Returns the URLs reachable from the parameter URL
        The assets of each URL are also returned.
        Only URLs with the same hostname including subdomain as the parameter URL are returned.
        """

        url = self.url_util.normalise_url(url)
        hostname = self.url_util.get_hostname(url)

        urlsToVisit = [url]
        urlsVisted = []
        output = []
        # Each iteration of this loop processes the next URL to visit.
        while (len(urlsToVisit) > 0):

            url = urlsToVisit.pop(0)
            urlsVisted.append(url)

            html = self.html_requester.get_html(url)
            links = self.html_parser.get_links(html)
            same_hostname_urls = self.html_parser.get_same_hostname_urls(
                hostname, links)
            assets = self.html_parser.get_assets(same_hostname_urls)
            web_pages = self.html_parser.get_web_pages(same_hostname_urls)

            output.append({"url": url, "assets": assets})
            print json.dumps({"url": url, "assets": assets}, indent=4)

            for web_page in web_pages:
                # Do not visit a page more than once
                if not web_page in urlsToVisit and web_page not in urlsVisted:
                    urlsToVisit.append(web_page)

        return json.dumps(output, indent=4).splitlines()