Python extract_links_from_webpage 예제들

프로그래밍 언어: Python

네임스페이스/패키지 이름: helpers

메소드/함수: extract_links_from_webpage

hotexamples.com에서의 예제들: 2

Python extract_links_from_webpage - 2개의 예제가 발견되었습니다. 이것들은 오픈소스 프로젝트에서 추출된 Python의 helpers.extract_links_from_webpage에 대한 실세계 최고 등급의 예제들입니다. 예제들을 평가하여 예제의 품질 향상에 도움을 줄 수 있습니다.

예제 #1

파일 보기

# each time you crawl, clear out your database and start over:
helpers.create_or_replace_table()

# add the first link to crawl:
urls = ['https://www.mccormick.northwestern.edu/eecs/courses/']
visited = {}
counter = 0
while len(urls) > 0:
    # get the next url
    url = urls.pop(0)
    soup = helpers.get_webpage(url)
    counter += 1

    # extract urls from the web page (already done for you)
    webpage_urls = helpers.extract_links_from_webpage(soup, url)

    # extract key data from the web page (already done for you):
    # (print the row variable to understand it)
    row = helpers.extract_data_from_webpage(soup, url)
    print(row['body'])
    urls += webpage_urls

    # YOUR TASKS:
    # 1. Add the urls that you found to the urls list so that the
    #    webpage keeps crawling (b/c of the while loop condition)
    #    just like Tutorial 7.
    print('add webpage_urls to the urls list')

    # 2. Track how many times each url has been visited as you crawl;
    #    and don't crawl the same page twice.

예제 #2

파일 보기

파일: crawler.py 프로젝트: eecs110/winter2019

import time
import helpers

# add the first link to crawl:
urls = ['https://www.northwestern.edu/']
pagerank = {}
while len(urls) > 0:
    #########################
    # Don't forget to sleep #
    #########################
    time.sleep(2)

    # removes the top url from the list
    url = urls.pop(0)
    print('\nretrieving ' + url + '...')
    soup = helpers.get_webpage(url)
    if soup is None:
        print('Error retrieving {url}'.format(url=url))
    else:
        website_summary = helpers.extract_website_summary_from_webpage(soup)
        links_on_page = helpers.extract_links_from_webpage(soup, url)
        helpers.write_links_to_file(links_on_page)

        print(website_summary)

        # Goal: modify this code to crawl through all the links of the northwestern
        # website, and track how many times each website is linked to, using a dictionary.