Python Spider.crawl Examples

Programming Language: Python

Namespace/Package Name: Spider

Class/Type: Spider

Method/Function: crawl

Examples at hotexamples.com: 4

The python spider.Spider.crawl is a method that allows a spider program to initiate the crawling process. It is used to traverse various web pages and gather relevant information. This method enables the spider to start visiting URLs and extracting data based on the specified crawling logic. It is an essential function for building web scraping programs and automating data extraction tasks.

Python Spider.crawl - 4 examples found. These are the top rated real world Python examples of Spider.Spider.crawl extracted from open source projects. You can rate examples to help us improve the quality of examples.

Frequently Used Methods

Show Hide

Spider(30)

__init__(5)

crawl(4)

run(3)

crawl_page(3)

get_page(2)

scapy(2)

getContent(2)

insert_jobs(1)

load_users(1)

main(1)

make_request(1)

modify_data(1)

modify_grade(1)

search_cite_papers(1)

printProblems(1)

queryMatrixProblems(1)

hrefFor2018(1)

queryProblems(1)

read_library(1)

status_crawl(1)

serialFetchAllProblems(1)

runSpider(1)

parallelFetchAllProblems(1)

get_my_follower(1)

getdoc(1)

crawl_url(1)

adapt_job_city(1)

alive(1)

analyze_jobs(1)

associate_key_and_job(1)

close(1)

contentOfArtical(1)

crawlPlayer(1)

crawl_error_user(1)

crawler_data(1)

get_page_count(1)

crawling(1)

crawljobs(1)

deleteDatabase(1)

enable_collection(1)

getData(1)

getItemsCount(1)

get_my_fans(1)

Start(1)

user_crawl(1)

Example #1

Show file

def main():
    ### The start page's URL
    start_url = 'https://scholar.google.com.tw/citations?view_op=search_authors&hl=en&mauthors=label:complex_systems'

    ### p_key and n
    p_key = []
    n_key = []

    ### Google Scholar Crawler, Class Spider
    myCrawler = Spider(start_url, p_key, n_key, page=5)

    results = myCrawler.crawl()

    with open('result.pickle', 'wb') as f:
        pickle.dump(results, f, protocol=pickle.HIGHEST_PROTOCOL)

Example #2

Show file

File: google_crawler.py Project: w140601/google-scholar-crawler

def main():
    ### The start page's URL
    start_url = 'https://scholar.google.com.tw/scholar?q=frequency+lowering+algorithm&hl=zh-TW&as_sdt=0,5'

    ### p_key and n
    p_key = [
        'wdrc', 'dynamic range compression', 'hearing aid', 'speech',
        'noise cancellation', 'noise reduction', 'feedback cancellation',
        'sound', 'hearing loss'
    ]
    n_key = [
        'imagery', 'image', 'visual', 'video', 'optic', 'opto', 'quantum',
        'photon'
    ]

    ### Google Scholar Crawler, Class Spider
    myCrawler = Spider(start_url, p_key, n_key, page=5)

    results = myCrawler.crawl()

    with open('result.pickle', 'wb') as f:
        pickle.dump(results, f, protocol=pickle.HIGHEST_PROTOCOL)

Example #3

Show file

File: Starter.py Project: adityathakker/Varys-Search-Engine

from Spider import Spider
from Query import Query
import sys

arguments = sys.argv
if arguments[1] == "crawl":
    spider = Spider("https://en.wikipedia.org/")
    spider.crawl()
elif arguments[1] == "query":
    query = Query(arguments[2])
    query.query()
# # query.multiWordQuery(["action","design"])

Example #4

Show file

if __name__ == "__main__":

    pickle = os.listdir('pickle/')
    print('当前的已保存搜索文件:', pickle)
    name = input('输入搜索代号:')
    path = name + '.pickle'
    used_path = name + '_used.pickle'
    spider_main = Spider(name, used_path)
    if path not in pickle:
        start = time.time()
        url = 'https://www.bilibili.com/index/rank/all-30-3.json'
        
        
        try:
            spider_main.crawl(url, path)
        except Exception as e:
            with open('error/error.txt', 'a+') as f:
                f.write('94'+str(e) + '\n')
                
        end = time.time()
        times = int(end - start)
        if times > 60:
            mins = times//60
            second = times - mins * 60
            print('搜索用户所用时间为%d分%d秒' % (mins, second))
        else:
            print('搜索用户所用时间为%d秒' % times)
    else:
        # 加载先前下载好的文件
        spider_main.load_users(path, used_path)