Python LinkConstraint 예제들

프로그래밍 언어: Python

네임스페이스/패키지 이름: pyoogle.preprocessing.crawl.linkconstraint

클래스/타입: LinkConstraint

hotexamples.com에서의 예제들: 4

Python LinkConstraint - 4개의 예제가 발견되었습니다. 이것들은 오픈소스 프로젝트에서 추출된 Python의 pyoogle.preprocessing.crawl.linkconstraint.LinkConstraint에 대한 실세계 최고 등급의 예제들입니다. 예제들을 평가하여 예제의 품질 향상에 도움을 줄 수 있습니다.

자주 사용되는 메소드들

보기 숨기기

LinkConstraint(2)

add_rule(2)

예제 #1

파일 보기

파일: crawler.py 프로젝트: DanDits/Pyoogle

def crawl_mathy():

    # Build constraint that describes which outgoing WebNode links to follow
    constraint = LinkConstraint('http', 'www.math.kit.edu')

    # Prevent downloading links with these endings
    # Frequent candidates: '.png', '.jpg', '.jpeg', '.pdf', '.ico', '.doc', '.txt', '.gz', '.zip', '.tar','.ps',
    # '.docx', '.tex', 'gif', '.ppt', '.m', '.mw', '.mp3', '.wav', '.mp4'
    forbidden_endings = ['.pdf', '.png', '.ico', '#top']  # for fast exclusion
    constraint.add_rule(lambda link: all((not link.lower().endswith(ending) for ending in forbidden_endings)))

    # Forbid every point in the last path segment as this likely is a file and we are not interested in it
    def rule_no_point_in_last_path_segment(link_parsed):
        split = link_parsed.path.split("/")
        return len(split) == 0 or "." not in split[-1]
    constraint.add_rule(rule_no_point_in_last_path_segment, parsed_link=True)

    # Start the crawler from a start domain, optionally loading already existing nodes
    from pyoogle.config import DATABASE_PATH
    path = DATABASE_PATH
    c = Crawler(path, constraint)
    c.start("http://www.math.kit.edu", clear_store=False)

    # Wait for the crawler to finish
    c.join()
    webnet = c.web_net
    logging.info("DONE, webnet contains %d nodes", len(webnet))
    return path, webnet

예제 #2

파일 보기

파일: crawler.py 프로젝트: DanDits/Pyoogle

def crawl_spon():
    constraint = LinkConstraint('', 'www.spiegel.de')

    # Forbid every point in the last path segment as this likely is a file and we are not interested in it
    def rule_no_point_in_last_path_segment(link_parsed):
        split = link_parsed.path.split("/")
        return len(split) == 0 or ("." not in split[-1] or
                                   split[-1].lower().endswith(".html") or split[-1].lower().endswith(".htm"))

    constraint.add_rule(rule_no_point_in_last_path_segment, parsed_link=True)
    path = "/home/daniel/PycharmProjects/PageRank/spon.db"
    c = Crawler(path, constraint)
    c.start("http://www.spiegel.de", clear_store=False)

    # Wait for the crawler to finish
    c.join()
    webnet = c.web_net
    logging.info("DONE, webnet contains %d nodes", len(webnet))
    return path, webnet

예제 #3

파일 보기

파일: crawler.py 프로젝트: DanDits/Pyoogle

def crawl_spon():
    constraint = LinkConstraint('', 'www.spiegel.de')

    # Forbid every point in the last path segment as this likely is a file and we are not interested in it
    def rule_no_point_in_last_path_segment(link_parsed):
        split = link_parsed.path.split("/")
        return len(split) == 0 or ("." not in split[-1]
                                   or split[-1].lower().endswith(".html")
                                   or split[-1].lower().endswith(".htm"))

    constraint.add_rule(rule_no_point_in_last_path_segment, parsed_link=True)
    path = "/home/daniel/PycharmProjects/PageRank/spon.db"
    c = Crawler(path, constraint)
    c.start("http://www.spiegel.de", clear_store=False)

    # Wait for the crawler to finish
    c.join()
    webnet = c.web_net
    logging.info("DONE, webnet contains %d nodes", len(webnet))
    return path, webnet

예제 #4

파일 보기

파일: crawler.py 프로젝트: DanDits/Pyoogle

def crawl_mathy():

    # Build constraint that describes which outgoing WebNode links to follow
    constraint = LinkConstraint('http', 'www.math.kit.edu')

    # Prevent downloading links with these endings
    # Frequent candidates: '.png', '.jpg', '.jpeg', '.pdf', '.ico', '.doc', '.txt', '.gz', '.zip', '.tar','.ps',
    # '.docx', '.tex', 'gif', '.ppt', '.m', '.mw', '.mp3', '.wav', '.mp4'
    forbidden_endings = ['.pdf', '.png', '.ico', '#top']  # for fast exclusion
    constraint.add_rule(lambda link: all(
        (not link.lower().endswith(ending) for ending in forbidden_endings)))

    # Forbid every point in the last path segment as this likely is a file and we are not interested in it
    def rule_no_point_in_last_path_segment(link_parsed):
        split = link_parsed.path.split("/")
        return len(split) == 0 or "." not in split[-1]

    constraint.add_rule(rule_no_point_in_last_path_segment, parsed_link=True)

    # Start the crawler from a start domain, optionally loading already existing nodes
    from pyoogle.config import DATABASE_PATH
    path = DATABASE_PATH
    c = Crawler(path, constraint)
    c.start("http://www.math.kit.edu", clear_store=False)

    # Wait for the crawler to finish
    c.join()
    webnet = c.web_net
    logging.info("DONE, webnet contains %d nodes", len(webnet))
    return path, webnet