Python Scraper.create_http_link Examples

Programming Language: Python

Namespace/Package Name: scraper

Class/Type: Scraper

Method/Function: create_http_link

Examples at hotexamples.com: 1

Python Scraper.create_http_link - 1 examples found. These are the top rated real world Python examples of scraper.Scraper.create_http_link extracted from open source projects. You can rate examples to help us improve the quality of examples.

Frequently Used Methods

Show Hide

Scraper(30)

matchTag(7)

connect(6)

__init__(5)

_time_now(5)

close(5)

submit(3)

find_docs(3)

get_children(3)

create_destination(2)

extractTag(2)

get_papers(2)

begin(2)

get_all_page_uris(1)

get_all_skills(1)

get_css(1)

get_and_write_records(1)

getZipLinks(1)

get_manga(1)

get_paths(1)

get_post_data_per_page(1)

get_all_manga(1)

getGameList(1)

getSlist(1)

getQlist(1)

getInformation(1)

getIndexhtm(1)

get_prices(1)

getEvents(1)

getDepts(1)

getAppList(1)

gather_reddit_data(1)

fetch_most_recent_transactions(1)

fetch_booster_usage(1)

extractText(1)

create_organization_sets(1)

create_http_link(1)

get_price(1)

DownloadImage(1)

get_script(1)

scrape_ingredients(1)

update_submission_content(1)

store_parse(1)

stopped(1)

sort(1)

seturldata(1)

set_started_callback(1)

set_output_file(1)

set_finished_callback(1)

set_broadcast_document_callback(1)

Example #1

Show file

File: spider.py Project: w0rmh013/WebCrawler

    def __init__(self, url, domain, limit, limit_param, result_file_name,
                 max_threads, sema, verbose):
        """
        Create instance of Spider.

        :param url: website url
        :param domain: domain of website
        :param limit: crawling limit type ("depth" or "count")
        :param limit_param: limit parameter (max depth or max number of pages)
        :param result_file_name: file to store results in
        :param max_threads: maximum number of threads per process
        :param sema: semaphore (used for release action)
        :param verbose: verbosity of Spider
        """
        # note: the locks are necessary for the parallel work and updating of variables
        self._emails_file_path = result_file_name

        self._max_threads = max_threads

        self._sema = sema  # a semaphore

        self._url = Scraper.create_http_link(
            urlsplit(url))  # starting url should also be encoded
        self._domain = domain

        # set limit properties
        self.limit = limit
        self.limit_param = limit_param

        # pages scanned count
        self._count = 0
        self._count_lock = Lock()  # lock count variable

        # create links-to-visit queue
        self._to_visit = Queue()
        self._to_visit.put(self._url)

        self._scraper = Scraper(self._domain, url)  # spider's links scraper
        self._emails = list(
        )  # list of emails already found (no need to use hash list since emails are usually short)
        self._email_lock = Lock()  # lock to emails file

        self.verbose = verbose