Python normalize_urlの例

プログラミング言語: Python

名前空間/パッケージ名: ccrawler.policies.objects.url_analyser

メソッド/関数: normalize_url

hotexamples.comのコード掲載数: 2

Python normalize_url - 2件のコード例が見つかりました。すべてオープンソースプロジェクトから抽出されたPythonのccrawler.policies.objects.url_analyser.normalize_urlの実例で、最も評価が高いものを厳選しています。コード例の評価を行っていただくことで、より質の高いコード例が表示されるようになります。

コード例 #1

ファイルを表示

ファイル: link_extractor.py プロジェクト: qwang2505/ccrawler

    def _normalize_links(self, url, link_infos):
        links = []
        for link, _ in link_infos:
            link = url_analyser.normalize_url(link, url)
            if link is not None and link != url:
                links.append(link)

        return links

コード例 #2

ファイルを表示

ファイル: crawl_handler.py プロジェクト: qwang2505/ccrawler

    def _process(self, message):
        # normalize url
        url = url_analyser.normalize_url(message["url"])
        if url is None:
            logging.error("invalid url for crawl", url = message["url"])
            return {"status" : -1}
        message["url"] = url

        #fill optional fields
        url_info = misc.clone_dict(message, fields = ["url", "source", "root_url", "parent_url", "crawl_priority", "crawl_depth"])
        self._assign_url_info_defaults(url_info)

        if url_info["root_url"] is None:
            url_info["root_url"] = url

        #deterimine crawl priority/depth
        is_valid, url_info["crawl_priority"], url_info["crawl_depth"] = crawl_priority_and_depth_evaluator.evaluate(url, url_info["source"], url_info)
        if not is_valid:
            return {"status" : -1}

        # stores to urlRepository table
        url_info["page_last_modified"] = None
        url_info["crawl_status"] = "crawling"
        url_info["last_crawled"] = None
        url_info["original_url"] = None
        # all urls is static now
        url_info["crawl_type"] = "static"
        # TODO add to crawler db, this should not be done here
        # some project do not need to store url info into database
        # should use middleware for these kind of actions
        #success, promoted = crawlerdb.add_url_info(url, url_info, True)

        if message["source"] != "redirected":
            # notify crawler
            message_type, crawler_message = CrawlerUtils.build_crawler_request_msg(url, url_info)
            handler.HandlerRepository.process(message_type, crawler_message)

        return {"status" : 1}