Python HtmlParser.get_assets Examples

Programming Language: Python

Namespace/Package Name: html_parser

Class/Type: HtmlParser

Method/Function: get_assets

Examples at hotexamples.com: 2

Python HtmlParser.get_assets - 2 examples found. These are the top rated real world Python examples of html_parser.HtmlParser.get_assets extracted from open source projects. You can rate examples to help us improve the quality of examples.

Frequently Used Methods

Show Hide

HtmlParser(30)

city_parser(3)

county_parser(3)

feed(2)

get_assets(2)

close(1)

contextParer(1)

extract_url(1)

get_answer_count(1)

get_article_count(1)

get_ask_question_count(1)

get_brief_info(1)

get_collection_count(1)

get_education(1)

write_to_file(1)

Example #1

Show file

 def test_get_assets(self):
     """
     Tests get_file_contents method
     """
     file_util = FileUtil()
     expected_assets = file_util.get_file_contents("assets_test_data.txt")
     html_parser = HtmlParser()
     urls = file_util.get_file_contents("same_hostname_urls_test_data.txt")
     actual_assets = html_parser.get_assets(urls)
     self.assertEqual(expected_assets, actual_assets)

Example #2

Show file

File: web_crawler.py Project: kitlawes/web-crawler

class WebCrawler:
    def __init__(self):
        self.url_util = UrlUtil()
        self.html_requester = HtmlRequester()
        self.html_parser = HtmlParser()

    def crawl(self, url):
        """
        Returns the URLs reachable from the parameter URL
        The assets of each URL are also returned.
        Only URLs with the same hostname including subdomain as the parameter URL are returned.
        """

        url = self.url_util.normalise_url(url)
        hostname = self.url_util.get_hostname(url)

        urlsToVisit = [url]
        urlsVisted = []
        output = []
        # Each iteration of this loop processes the next URL to visit.
        while (len(urlsToVisit) > 0):

            url = urlsToVisit.pop(0)
            urlsVisted.append(url)

            html = self.html_requester.get_html(url)
            links = self.html_parser.get_links(html)
            same_hostname_urls = self.html_parser.get_same_hostname_urls(
                hostname, links)
            assets = self.html_parser.get_assets(same_hostname_urls)
            web_pages = self.html_parser.get_web_pages(same_hostname_urls)

            output.append({"url": url, "assets": assets})
            print json.dumps({"url": url, "assets": assets}, indent=4)

            for web_page in web_pages:
                # Do not visit a page more than once
                if not web_page in urlsToVisit and web_page not in urlsVisted:
                    urlsToVisit.append(web_page)

        return json.dumps(output, indent=4).splitlines()