Python RobotFileParser.allow_all示例

编程语言: Python

命名空间/包名称: urllib.robotparser

类/类型: RobotFileParser

方法/功能: allow_all

hotexamples.com的示例: 4

Python RobotFileParser.allow_all - 已找到4个示例。这些是从开源项目中提取的最受好评的urllib.robotparser.RobotFileParser.allow_all现实Python示例。您可以评价示例，以帮助我们提高示例质量。

常用方法

显示隐藏

RobotFileParser(30)

can_fetch(30)

parse(30)

read(30)

set_url(30)

crawl_delay(7)

site_maps(5)

allow_all(4)

request_rate(3)

__init__(1)

disallow_all(1)

mtime(1)

示例#1

显示文件

def get_robotstxt_parser(url, session=None):
    """Get a RobotFileParser for the given robots.txt URL."""
    rp = RobotFileParser()
    try:
        req = urlopen(url, session, max_content_bytes=MaxContentBytes,
                      allow_errors=range(600))
    except Exception:
        # connect or timeout errors are treated as an absent robots.txt
        rp.allow_all = True
    else:
        if req.status_code >= 400:
            rp.allow_all = True
        elif req.status_code == 200:
            rp.parse(req.text.splitlines())
    return rp

示例#2

显示文件

文件： db_classes.py 项目： jgombac/crawler

 def get_robots(self):
     rp = RobotFileParser()
     if self.robots_content:
         rp.parse(self.robots_content)
     else:
         rp.allow_all = True
     return rp

示例#3

显示文件

文件： http_client.py 项目： martinraag/simple-crawler

async def parse_robots(session, base):
    """Fetches and parses the robots.txt file from a given base URL. Returns an instance of
    RobotFileParser."""

    url = urljoin(base, "robots.txt")
    async with session.get(url) as response:
        status = response.status
        text = await response.text()
    robot_parser = RobotFileParser()
    if status == 200:
        robot_parser.parse(text.splitlines())
    else:
        robot_parser.allow_all = True
    return robot_parser

示例#4

显示文件

文件： store.py 项目： snowwym/proj_news_viz

 def get_robots_parser(self, url: str):
     rp = RobotFileParser()
     if self.store.exists(url, 'txt'):
         body = self.store.load_url(url, 'txt')
     else:
         page, status_code = download_page(url, 'Robot')
         body = page.body
         if status_code in [401, 403]:
             body = self.DISALLOW_ALL
         elif 400 <= status_code < 500:  # including status_code 404
             body = self.ALLOW_ALL
         self.store.save_url(url, body, 'txt')
     if body.strip() == self.ALLOW_ALL:
         rp.allow_all = True
     elif body.strip() == self.DISALLOW_ALL:
         rp.disallow_all = True
     else:
         rp.parse(body.decode('utf-8').splitlines())
     return rp