Python Crawler.crawl_web Examples

Programming Language: Python

Namespace/Package Name: crawler

Class/Type: Crawler

Method/Function: crawl_web

Examples at hotexamples.com: 2

Python Crawler.crawl_web - 2 examples found. These are the top rated real world Python examples of crawler.Crawler.crawl_web extracted from open source projects. You can rate examples to help us improve the quality of examples.

Frequently Used Methods

Show Hide

Crawler(30)

__init__(27)

map(15)

assets_json(5)

info(4)

visit(3)

analyze(3)

_get_url_contents(3)

__subclasses__(3)

get_Torrents_List(2)

load(2)

download_Page_Files(2)

crawl_web(2)

reset(2)

response(2)

add_data(2)

_same_host(2)

_has_product(2)

return_soup(2)

dump(2)

AddURLs(2)

Grab(2)

Start(2)

ToggleTOR(2)

Update(2)

isValidUrl(1)

open_browser(1)

open(1)

mostrarConfig(1)

GetInfoNames(1)

GetInfoValues(1)

keepUrl(1)

isAlive(1)

poll(1)

insert_root(1)

GetTasks(1)

headers(1)

get_top_news(1)

get_result(1)

get_records(1)

get_pagelist(1)

get_headers(1)

get_forms(1)

output_csv(1)

recuperarInf(1)

post(1)

scrape_registrations(1)

submit(1)

silent(1)

show_imagelist(1)

Example #1

Show file

File: main.py Project: naiyt/Search-engine

	def post(self):
		seed = self.request.get('seed')
		maxpages = int(self.request.get('maxpages'))
		maxdepth = int(self.request.get('maxdepth'))
		rest = int(self.request.get('rest'))
		my_crawler = Crawler(seed, maxpages, maxdepth, rest)
		my_crawler.crawl_web()
		my_crawler.compute_ranks()

Example #2

Show file

File: master.py Project: F10andtheHotKeys/html-element-mapreduce

def main():
    """
        Contributors:
            - Scot Matson
    """
    # TODO: This needs to be dynamically generated. The data gets destroyed
    #       during runtime - GitHub does not store empty directories.
    #       
    #       Going parallel, we may want to use multiple directories to help
    #       separate data sets, i.e., web1, web2, web3....... etc.
    data_directory = './web_pages/'

    # This is the biggest bottleneck in the application
    # Parallelizing this piece would give a major efficiency boost
    Crawler.crawl_web('http://www.sjsu.edu', 10)

    # Build a list of the files we have
    filepaths = list()
    for filename in os.listdir(data_directory):
        filepaths.append(data_directory + filename)

    map_data = list()
    for filepath in filepaths:
        fh = open(filepath, 'r')
        map_data.append(MapReduce.map(filepaths[0], fh))
        fh.close()
        # Remove files after parsing
        os.remove(filepath)

    intermediate_data = dict()
    # Shuffle data set
    intermediate_data = MapReduce.unshuffle(map_data)

    # Reduce should be condensing the data
    reduced_data = dict()
    reduced_data = MapReduce.reduce(intermediate_data)

    # Output results in reverse sequential ordering by value
    csvfile = open('out/results.csv', 'w')
    writer = csv.writer(csvfile, delimiter=',')
    for tag in sorted(reduced_data, key=reduced_data.get, reverse=True):
        print(tag + ': '+ str(reduced_data[tag]))
        writer.writerow([tag, reduced_data[tag]])
    csvfile.close()