Python CrawlerUtils Exemples

Langage de programmation: Python

Espace de nommage/Pack: ccrawler.modules.crawler_utils

Class/Type: CrawlerUtils

Exemples au hotexamples.com: 2

Python CrawlerUtils - 2 exemples trouvés. Ce sont les exemples réels les mieux notés de ccrawler.modules.crawler_utils.CrawlerUtils extraits de projets open source. Vous pouvez noter les exemples pour nous aider à en améliorer la qualité.

Méthodes fréquemment utilisées

Afficher Cacher

build_crawler_request_msg(2)

Méthodes fréquemment utilisées

build_crawler_request_msg (2)

Associées

Expressions

DirichletBCSet

push

YoutubeForm

SimpleSEGYWriter

cleanup_std_log

point_in_rect_q

W_NDimArray

SettingsForm

xbleu

Related in langs

Campaign (PHP)

CodeGenerator (PHP)

TaxiCompany (C#)

PoolingDescriptor (C#)

jpeg_load_image_from_buffer (C++)

expr_ctx (C++)

Log (Go)

V8_ObjectTemplate_NewInstance (Go)

SpawnedVMSupport (Java)

LWWindowPeer (Java)

Exemple #1

0

Afficher le fichier

Fichier : sort_scheduler.py Projet : qwang2505/ccrawler

def _process(self): while True: now = datetime.datetime.utcnow() url_info = crawlerdb.find_and_modify_expired_url_info(now, common_settings.crawler_msg_meta_fields) if url_info is None: break url = url_info["url"] message_type, crawler_request_msg = CrawlerUtils.build_crawler_request_msg(url, url_info) handler.HandlerRepository.process(message_type, crawler_request_msg) logging.debug(self._log_formatter("sent to crawler", url=url))

Exemple #2

0

Afficher le fichier

Fichier : crawl_handler.py Projet : qwang2505/ccrawler

def _process(self, message): # normalize url url = url_analyser.normalize_url(message["url"]) if url is None: logging.error("invalid url for crawl", url = message["url"]) return {"status" : -1} message["url"] = url #fill optional fields url_info = misc.clone_dict(message, fields = ["url", "source", "root_url", "parent_url", "crawl_priority", "crawl_depth"]) self._assign_url_info_defaults(url_info) if url_info["root_url"] is None: url_info["root_url"] = url #deterimine crawl priority/depth is_valid, url_info["crawl_priority"], url_info["crawl_depth"] = crawl_priority_and_depth_evaluator.evaluate(url, url_info["source"], url_info) if not is_valid: return {"status" : -1} # stores to urlRepository table url_info["page_last_modified"] = None url_info["crawl_status"] = "crawling" url_info["last_crawled"] = None url_info["original_url"] = None # all urls is static now url_info["crawl_type"] = "static" # TODO add to crawler db, this should not be done here # some project do not need to store url info into database # should use middleware for these kind of actions #success, promoted = crawlerdb.add_url_info(url, url_info, True) if message["source"] != "redirected": # notify crawler message_type, crawler_message = CrawlerUtils.build_crawler_request_msg(url, url_info) handler.HandlerRepository.process(message_type, crawler_message) return {"status" : 1}