Please note: all tools/ scripts in this repo are released for use "AS IS" without any warranties of any kind, including, but not limited to their installation, use, or performance. We disclaim any and all warranties, either express or implied, including but not limited to any warranty of noninfringement, merchantability, and/ or fitness for a particular purpose. We do not warrant that the technology will meet your requirements, that the operation thereof will be uninterrupted or error-free, or that any errors will be corrected. Any use of these scripts and tools is at your own risk. There is no guarantee that they have been through thorough testing in a comparable environment and we are not responsible for any damage or data loss incurred with their use. You are responsible for reviewing and testing any scripts you run thoroughly before use in any non-testing environment.
A domain-specific search engine powered by MongoDB.
Deploying the pre-packaged site requires only a few steps.
- Start a MongoDB instance configured to your liking on port 27017.
- Run
python scrape.py
to collect data. (Warning, it takes quite a while to collect all the data and requires about 2GB of disk space.) - Run
python util/indexes.py
to build the index. (This may also take upwards of 10 minutes, depending on how much data you collected.) - Run
python search.py
. The search engine is now deployed tolocalhost:5000
.
For instructions on the query language, see the cheat sheet on the home page of your deployment.
It is recommended to schedule scrapes in some way to keep your data up to date. The recommended method of this is with cron
; hence an example crontab
is included.
The file tasks.py
is included to allow the scrapers to be run with Celery. To start the celery workers, run:
$ celery -A tasks worker
$ python tasks.py
Executing tasks.py will queue up an initial scrape for each data source. When a source is finished, its default behavior is to immediately start again. For this reason, it's advisable to run celery in the background by using celeryd (or your favorite solution).
Scrapes that run after the initial ones are incrememtal, only looking for data that's updated since the previous scrape (although this behavior is not currently available in Confluence). Consider this when deciding how often to scrape.
Several common data sources, such as Stack Overflow, Github, and JIRA are available out of the box. However, defining other data sources is possible. There are a handful of steps required.
- Build a new scraper for your source in the
scrapers
directory. It should inherit frombase_scraper.BaseScraper
and implement the_scrape
(and possibly_setup
) methods. If your scraper requires setup, make sure to set theself.needs_setup
flag to True. The purpose of the scraper is to describe the logic behind any RESTful API you may be using. Once you are done and want to use your scraper, import it inside thescrapers/__init__.py
file. - Build a transformer for your new data source. This simply packages the raw data your scraper saved into a small, more web-friendly format. As with the scraper, define it in the
transformers
directory and import it intransformers/__init__.py
. - Add a results template to
templates/results
. If you want your template to have special styling, you should add style information tostatic/style.css
. - Add a configuration field to
config/search.py
. There is a lot of information to specify here. The following is an example configuration object:
# in config/search.py
CONFIG = {
...
'my_new_source': {
'fullname': 'My New Source', # the display name
'scraper': scrapers.MyNewScraper, # how to scrape for this source
'transformer': transformers.MyNewTransformer, # how to transform
'subsources': { # if you have subsources, define their description here. otherwise, this field is None
'name': 'section',
'field': 'some_json_field'
},
'projector': {
# all the fields you want saved by the scraper. just include the field name with a value of 1.
},
'view': 'results/my_new_source_result.html', # the display template
'advanced': [ # the list of advanced search options
{
'name': 'An Advanced Field',
'field': 'some_field',
'type': 'text'
}
],
'auth': { # auth information. if your source does not need any, leave it out.
'user': 'username',
'password': 'password'
},
# if you want any other information included, add it here
},
...
}
Now you can run your scraper by running python scrape.py my_new_source
. Once it's all done, you're ready to search!