DocSearch scraper

This is the repository for the scraper for the DocSearch project. You can run it on your own, or ask us to crawl your documentation.

DocSearch is in fact 3 different projects.

The front-end of DocSearch: https://github.com/algolia/docsearch
The scraper which browses & indexes web pages: https://github.com/algolia/docsearch-scraper
The configurations for the scraper: https://github.com/algolia/docsearch-configs

This project is a collection of submodules, each one in its own directory:

cli: A command line tool to manage DocSearch. Run ./docsearch and follow the steps
deployer: Tool used by Algolia to deploy the configuration in our mesos infrastructure
doctor: A monitoring/repair tool to check if the indices built by the scraper are in good shape
playground: An HTML page to easily test DocSearch indices
scraper: The core of the scraper. It reads the configuration file, fetches the web pages and indexes them in Algolia.

Getting started

Install Docsearch

Install python
- brew install python # will install pip
- apt-get install python
- Or any other way
git clone git@github.com:algolia/documentation-scraper.git
cd documentation-scraper
pip install -r requirements.txt
Download geckodriver from https://github.com/mozilla/geckodriver/releases, extract it
rename the geckodriver executable to wires and make it accessible in the path
Depending on what you want to do you might also need to install docker, especially to run tests.

Set up DocSearch

Create a file named .env file at the root of the project:

APPLICATION_ID=
API_KEY=

To have the APPLICATION_ID and API_KEY, you need to create an [https://www.algolia.com/users/sign_up](Algolia account).

You should be able to do everything with the docsearch CLI tool:

$ ./docsearch
Docsearch CLI

Usage:
  ./docsearch command [options] [arguments]

Options:
  --help    Display help message

Available commands:
  test                  Run tests
  playground            Launch the playground
  run                   Run a config
 config
  config:bootstrap      Boostrap a docsearch config
  config:docker-run     Run a config using docker
 docker
  docker:build-scraper  Build scraper images (dev, prod, test)

Use DocSearch

Create a config

To use DocSearch the first thing you need is to create the config for the crawler. For more details about configs, check out https://github.com/algolia/docsearch-configs, you'll have a list of options you can use and a lot of live and working examples.

Crawl the website

Without docker:

$ ./docsearch run /path/to/your/config

With docker:

$ ./docsearch docker:build-scraper #Build the docker file
$ ./docsearch config:docker-run /path/to/your/config #run the docker container

Check that everything went well

Open ./playground/index.html in your browser, enter your credentials, your index name, and type some queries to make sure everything is ok.

Use docsearch frontend

Just add this snippet to your documentation:

<link rel="stylesheet" href="//cdn.jsdelivr.net/docsearch.js/2/docsearch.min.css" />
<script type="text/javascript" src="//cdn.jsdelivr.net/docsearch.js/2/docsearch.min.js"></script>

var search = docsearch({
  apiKey: '<API_KEY>',
  indexName: '<INDEX_NAME>',
  inputSelector: '<YOUR_INPUT_DOM_SELECTOR>',
  debug: false
});

And you are good to go!

Admin task

If you are Algolia employee and want to manage a DocSearch account, you'll need to add the following variables in your .env file:

WEBSITE_USERNAME=
WEBSITE_PASSWORD=
SLACK_HOOK=
SCHEDULER_USERNAME=
SCHEDULER_PASSWORD=
DEPLOY_KEY=

The cli will then have more commands for you to run.

For some actions like deploying you might need to use different credentials than the ones in the .env file. To do this you need to override them when running the cli tool:

APPLICATION_ID= API_KEY= ./docsearch deploy:configs

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
cli		cli
deployer		deployer
doctor		doctor
playground		playground
scraper		scraper
.gitignore		.gitignore
.travis.yml		.travis.yml
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
docsearch		docsearch
requirements.txt		requirements.txt

License

LordBrain/docsearch-scraper

Folders and files

Latest commit

History

Repository files navigation

DocSearch scraper

Getting started

Install Docsearch

Set up DocSearch

Use DocSearch

Create a config

Crawl the website

Check that everything went well

Use docsearch frontend

Admin task

About

Resources

License

Stars

Watchers

Forks

Languages