Skip to content



Folders and files

Last commit message
Last commit date

Latest commit


Repository files navigation


A multi-threaded, open source web crawler


  • Use multiple threads to visit web pages
  • Extract web page data using XPath expressions or CSS selectors
  • Extract urls from a web page and visit extracted urls
  • Write extracted data to an output file
  • Set HTTP session parameters such as: cookies, SSL certificates, proxies
  • Set HTTP request parameters such as: header, body, authentication
  • Download files from the urls
  • Supports Python 2 and Python 3


pip install xcrawler

When installing lxml library on Windows you may encounter Microsoft Visual C++ is required errors.
To install lxml library on Windows:
  1. Download and install Microsoft Windows SDK:
  2. Click the Start Menu, search for and open the command prompt:
    • For Python 2.6, 2.7, 3.0, 3.1, 3.2: CMD Shell
    • For Python 3.3, 3.4: Windows SDK 7.1 Command Prompt
  3. Install lxml
setenv /x86 /release && SET DISTUTILS_USE_SDK=1 && set STATICBUILD=true && pip install lxml


Data and urls are extracted from a web page by a page scraper.
To extract data and urls from a web page use the following methods:
extract returns data extracted from a web page
visit returns next Pages to be visited

A crawler can be configured before crawling web pages. A user can configure such settings of the crawler as:
* the number of threads used to visit web pages
* the name of an output file
* the request timeout
To run the crawler call:

Examples how to use xcrawler can be found at:

XPath Example

from xcrawler import XCrawler, Page, PageScraper

class Scraper(PageScraper):
    def extract(self, page):
        topics = page.xpath("//a[@class='question-hyperlink']/text()")
        return topics

start_pages = [ Page("", Scraper()) ]
crawler = XCrawler(start_pages)
crawler.config.output_file_name = "stackoverflow_example_crawler_output.csv"

CSS Example

from xcrawler import XCrawler, Page, PageScraper

class StackOverflowItem:
    def __init__(self):
        self.title = None
        self.votes = None
        self.tags = None
        self.url = None

class UrlsScraper(PageScraper):
    def visit(self, page):
        hrefs = page.css_attr(".question-summary h3 a", "href")
        urls = page.to_urls(hrefs)
        return [Page(url, QuestionScraper()) for url in urls]

class QuestionScraper(PageScraper):
    def extract(self, page):
        item = StackOverflowItem()
        item.title = page.css_text("h1 a")[0]
        item.votes = page.css_text(".question .vote-count-post")[0].strip()
        item.tags = page.css_text(".question .post-tag")[0]
        item.url = page.url
        return item

start_pages = [ Page("", UrlsScraper()) ]
crawler = XCrawler(start_pages)
crawler.config.output_file_name = "stackoverflow_css_crawler_output.csv"
crawler.config.number_of_threads = 3

File Example

from xcrawler import XCrawler, Page, PageScraper

class WikimediaItem:
    def __init__(self): = None
        self.base64 = None

class EncodedScraper(PageScraper):
    def extract(self, page):
        url = page.xpath("//div[@class='fullImageLink']/a/@href")[0]
        item = WikimediaItem() = url.split("/")[-1]
        item.base64 = page.file(url)
        return item

start_pages = [ Page("", EncodedScraper()) ]
crawler = XCrawler(start_pages)
crawler.config.output_file_name = "wikimedia_file_example_output.csv"

Session Example

from xcrawler import XCrawler, Page, PageScraper
from requests.auth import HTTPBasicAuth

class Scraper(PageScraper):
    def extract(self, page):
        return page.__str__()

start_pages = [ Page("", Scraper()) ]
crawler = XCrawler(start_pages)
crawler.config.output_file_name = "router_session_example_output.csv"
crawler.config.session.headers = {"User-Agent": "Custom User Agent",
                                  "Accept-Language": "fr"}
crawler.config.session.auth = HTTPBasicAuth('admin', 'admin')

Request Example

from xcrawler import XCrawler, Page, PageScraper

class Scraper(PageScraper):
    def extract(self, page):
        return page.__str__()

start_page = Page("", Scraper())
start_page.request.cookies = {"theme": "classic"}
crawler = XCrawler([start_page])
crawler.config.request_timeout = (5, 5)
crawler.config.output_file_name = "router_request_example_output.csv"


For more information about xcrawler see the source code and Python Docstrings: source code
The documentation can also be accessed at runtime with Python's built-in help function:
>>> import xcrawler
>>> help(xcrawler.Config)
    # Information about the Config class
>>> help(xcrawler.PageScraper.extract)
    # Information about the extract method of the PageScraper class


GNU GPL v2.0


A multi-threaded, open source web crawler







No packages published


  • Python 100.0%