Python SimpleWebScraper 예제들

프로그래밍 언어: Python

네임스페이스/패키지 이름: tmclass_exercises.scraping

클래스/타입: SimpleWebScraper

hotexamples.com에서의 예제들: 4

Python SimpleWebScraper - 4개의 예제가 발견되었습니다. 이것들은 오픈소스 프로젝트에서 추출된 Python의 tmclass_exercises.scraping.SimpleWebScraper에 대한 실세계 최고 등급의 예제들입니다. 예제들을 평가하여 예제의 품질 향상에 도움을 줄 수 있습니다.

자주 사용되는 메소드들

보기 숨기기

SimpleWebScraper(4)

can_fetch(1)

fetch(1)

fetch_and_save(1)

get_robot_url(1)

예제 #1

파일 보기

def test_robot_file_url():
    scraper = SimpleWebScraper()
    url = "https://en.wikipedia.org/wiki/Tomato"
    assert scraper.get_robot_url(url) == "https://en.wikipedia.org/robots.txt"

    url = "https://scikit-learn.org/stable/documentation.html"
    assert scraper.get_robot_url(url) == "https://scikit-learn.org/robots.txt"

예제 #2

파일 보기

def test_web_scraper_fetch():
    scraper = SimpleWebScraper()
    headers, body = scraper.fetch("https://fr.wikipedia.org/wiki/Tomate")
    assert isinstance(headers, dict)
    assert isinstance(body, bytes)
    assert headers['Content-Type'] == "text/html; charset=UTF-8"

    article = WikipediaArticle(body, encoding="utf-8")
    expected_link = "https://en.wikipedia.org/wiki/Tomato"
    assert expected_link in article.get_language_links()

    main_text = article.get_main_text()
    assert main_text.startswith("Solanum lycopersicum\n\nLe plant de tomates")

예제 #3

파일 보기

def test_web_scraper_fetch_and_save(tmpdir):
    scraper = SimpleWebScraper(output_folder=tmpdir)
    result_folder = scraper.fetch_and_save(
        "https://fr.wikipedia.org/wiki/Pomme_de_terre")

    result_folder == tmpdir / "fr.wikipedia.org" / "wiki" / "Pomme_de_terre"
    with open(result_folder / "headers.json") as f:
        headers = json.load(f)
    assert headers['Content-Type'] == "text/html; charset=UTF-8"

    body = (result_folder / "body").read_bytes()
    article = WikipediaArticle(body)
    expected_link = "https://en.wikipedia.org/wiki/Potato"
    assert expected_link in article.get_language_links()
    assert article.get_main_text().startswith(
        "Solanum tuberosum\n\nLa pomme de terre, ou patate[1]")

예제 #4

파일 보기

def test_web_scraper_robots_file_handling():
    scraper = SimpleWebScraper()
    assert scraper.can_fetch("https://en.wikipedia.org/wiki/Tomato")
    assert not scraper.can_fetch("https://en.wikipedia.org/api/")
    assert not scraper.can_fetch("https://en.wikipedia.org/wiki/Special:")