Esempi in Python per Utility.clean_url

Linguaggio di programmazione: Python

Spazio dei nomi/nome del pacchetto: src.utility

Classe/tipologia: Utility

Metodo/funzione: clean_url

Esempi su hotexamples.com: 2

Utility.clean_url in Python: 2 esempi trovati. Questi sono i migliori esempi reali in Python per src.utility.Utility.clean_url, estratti da progetti open source. Li puoi valutare, per aiutarci a migliorare la qualità dei nostri esempi.

Metodi utilizzati di frequente

Mostra Nascondi

Utility(11)

clean_url(2)

assess(1)

clean_date(1)

clean_stopping_words_and_phrases(1)

Esempio n. 1

Mostra file

    def parse_urls(self, html):
        """
        Produces a list of URLs present in the given html.

        :type html: str
        :rtype:     list
        """

        soup = BeautifulSoup(html, "html.parser")
        urls = []

        # (presumably) only in the main page
        for element in soup.findAll("h2", {"class": "section-heading"}):
            if element.a:
                url = element.a.get("href")
                if url not in self.visited_urls:
                    urls.append(Utility.clean_url(url))

        # in main page and appears as relevant articles
        for element in soup.findAll("a", {"class": "story-link"}):
            url = element.get("href")
            if url not in self.visited_urls:
                urls.append(Utility.clean_url(url))

        return urls

Esempio n. 2

Mostra file

File: crawler.py Progetto: ibemitchy/ScrapeYard

    def parse_urls(self, html):
        """
        Appends new URLs present in the given html too the URL queue.

        :type html: str
        """

        soup = BeautifulSoup(html, "html.parser")

        # this is (presumably) only in the main page
        for element in soup.findAll("h2", {"class": "section-heading"}):
            if element.a:
                url = element.a.get("href")
                if url not in self.visited_urls:
                    self.url_queue.append(Utility.clean_url(url))

        # in main page and appear as relevant articles
        for element in soup.findAll("a", {"class": "story-link"}):
            url = element.get("href")
            if url not in self.visited_urls:
                self.url_queue.append(Utility.clean_url(url))