Python extract_links示例

编程语言: Python

命名空间/包名称: crawler.utils

方法/功能: extract_links

hotexamples.com的示例: 5

Python extract_links - 已找到5个示例。这些是从开源项目中提取的最受好评的crawler.utils.extract_links现实Python示例。您可以评价示例，以帮助我们提高示例质量。

示例#1

显示文件

文件： test_extract_links.py 项目： citizen-stig/crawler-demo

def test_excludes_non_anchor():
    page = """
       <a href="http://example.com/test">click me</a>
       <a href="#about-us">about</a>
       <a onclick="something() href="">check</a>
       """
    urls = extract_links(page)
    assert len(urls) == 1
    assert "http://example.com/test2" in urls

示例#2

显示文件

文件： test_extract_links.py 项目： citizen-stig/crawler-demo

def test_extract_urls_from_href():
    page = """
    <a href="http://example.com/test">click me</a>
    <a href="/about-us">about</a>
    """

    urls = extract_links(page)
    assert len(urls) == 2
    assert "http://example.com/test" in urls
    assert "/about-us" in urls

示例#3

显示文件

文件： test_extract_links.py 项目： citizen-stig/crawler-demo

def test_excludes_non_web():
    page = """
        <a href="http://example.com/test">click me</a>
        <a href="mailto:[email protected]">mail us</a>
        <a href="skype:someone>call us</a>
        <a href="tel:+4733378901>call us here to</a>
        <a href="javascript:alert('Hello World!');">click</a>
    """

    urls = extract_links(page)
    assert len(urls) == 1
    assert "http://example.com/test2" in urls

示例#4

显示文件

文件： test_extract_links.py 项目： citizen-stig/crawler-demo

def test_extract_urls_from_plain_text():
    page = """
        <p>Visit http://example.com/test2 to know more or https://ya.ru/search</p>
        <p>Also http://your.shop/cart for more!</p>
        <p> thisisnothttp://ya.ru/even it look like one</p>
        <p> but amazon.com is shurele a link</a>
        <a href="/about-us">about</a>
        """

    urls = extract_links(page)
    assert len(urls) == 4
    assert "http://example.com/test2" in urls
    assert "/about-us" in urls
    assert "https://ya.ru/search" in urls
    assert "http://your.shop/cart" in urls

示例#5

显示文件

文件： executor.py 项目： citizen-stig/crawler-demo

    def _lookup_links(self, page: str):
        """Extracts all URLs from the page.

        Checks if they should be visited and updates self._to_visit"""
        links = utils.extract_links(page)
        for link in links:
            if link.startswith('/'):
                link = self.start_url.scheme \
                       + "://" \
                       + self.start_url.netloc \
                       + link
            parsed_link = urlparse(link)
            if parsed_link.netloc != self.start_url.netloc:
                continue
            if not parsed_link.path.startswith(self.start_url.path):
                continue
            if link in self._visited_urls:
                continue
            self._to_visit.add(link)