Python Page.is_valid Beispiele

Programmiersprache: Python

Namespace / Paketname: page

Klasse / Typ: Page

Methode / Funktion: is_valid

Beispiele auf hotexamples.com: 1

Python Page.is_valid - 1 Beispiele gefunden. Dies sind die am besten bewerteten Python Beispiele für die page.Page.is_valid, die aus Open Source-Projekten extrahiert wurden. Sie können Beispiele bewerten, um die Qualität der Beispiele zu verbessern.

Häufig verwendete Methoden

Anzeigen Verbergen

Page(30)

__init__(30)

content(8)

add_page(7)

addForm(5)

addBlockCenter(5)

button(4)

a(4)

body(4)

click(4)

create(4)

resources(3)

find(3)

in_index(3)

addBlockRight(3)

get_by_name(3)

close(2)

crawl(2)

import_tables(2)

data(2)

find_links_in_content(2)

gSoupToLinks(2)

index_num(2)

process_node(2)

scrap(2)

set_data(2)

click_add_new_product(2)

control(2)

all(2)

addScript(2)

add_html(2)

addBlockLeft(2)

_Icons(2)

_Name(2)

change_product_name(1)

parse_date(1)

navigate(1)

length(1)

is_valid(1)

insertThumbnail(1)

is_enabled(1)

print_form_results(1)

inflate(1)

add(1)

index(1)

parse_table(1)

printed_output(1)

print_info(1)

set_confidences(1)

validate(1)

Beispiel #1

Datei anzeigen

Datei: crawl.py Projekt: CrazyWearsPJs/minimalist_web_crawler

def crawl_web(seed, max_depth = 10, max_pages = 1000):
	crawled = set()
	crawl_queue =  [] # priority queue ensures that more "shallow" links are handled first
	index = {}
	graph = {}
	counter = itertools.count()

	"""
	Add set of links to queue of sets, crawled_queue.
	Makes sure links is not in the set of already crawled urls.
	"""
	def add_links(links, depth = 0):
		count = next(counter)
		new_links = links.difference(crawled)
		entry = Links(priority = depth, id = count, links = new_links)
		heapq.heappush(crawl_queue, entry)

	"""
	Adds all of the words in page.content to the index of words
	to sets of urls
	"""
	def index_page(page):
		words = page.content.split()
		for word in words:
			if word in index:
				index[word.lower()].add(page.url)
			else:
				index[word.lower()] = {page.url}

	add_links({seed}, 0)
	pages = 0
	while crawl_queue:
		entry = heapq.heappop(crawl_queue)
		to_crawl = entry.links
		depth = entry.priority
		while to_crawl and pages < max_pages:
			url = to_crawl.pop()
			page = Page(url)
			if page.is_valid() and not url in crawled:
				print url, depth
				pages += 1
				crawled.add(url)
				index_page(page)
				graph[url] = page.outgoing_links
				if depth < max_depth:
					add_links(page.outgoing_links, depth + 1)
	return index, graph