Simple but powerful open source web application and crawler for searching cocktail recipes from the web. Check out the demo or read below to run it on your own machine.
git clone https://github.com/wallunit/cocktail-search cd cocktail-search git submodule init git submodule update
If you are on Debian Wheezy you can install everything except less with apt-get:
apt-get install python-werkzeug python-scrapy python-stemmer python-lxml sphinxsearch
In order to install less, install node.js and run:
npm install -g less
Crawling websites will consume not only a lot of your bandwidth, but generates also a lot of traffic on the websites you are crawling. So please be nice and don't run the crawler unless absolutely necessary, for example when you have to test a spider, that you have just added or modified. For any other case, I made the files with the cocktail recipes I have already crawled available for you:
wget -r -A .json http://cocktails.p24dev.de/data/ mv cocktails.p24dev.de/data/* crawler/ rm -r cocktails.p24dev.de
However following command will run the crawler for a given spider:
cd crawler rm -f <spider>.json scrapy crawl <spider> -o <spider>.json
Note that when the output file already exist, scrapy will append scraped recipes at the bottom of the existing file. So make sure you delete it before.
There is no RDBMS. All data are stored in a Sphinx index that is built from the crawled cocktail recipes. In order to built the index and run the search daemon in the console, just run:
cd sphinx indexer --all searchd --console
In order to serve the website from your local machine and start hacking, there is no need to setup an advanced web server like Apache. Just run the development server and go to http://localhost:8000/ with your web browser:
./web/app runserver
By default the development server only listens on localhost. However if you want to access the website from an other device you can make it also listen on all interfaces:
./web/app runserver 0.0.0.0:8000
Create the file web/settings.py and set follwing options:
SITE_URL = 'http://cocktails.p24dev.de/' LESSC_OPTIONS = ['--compress']
<VirtualHost *:80> ServerName cocktails.p24dev.de Alias /static /var/www/cocktails/web/static WSGIDaemonProcess cocktails processes=4 maximum-requests=500 threads=1 WSGIProcessGroup cocktails WSGIScriptAlias / /var/www/cocktails/web/app.wsgi RewriteEngine On RewriteRule ^/$ /static/index.html [P] </VirtualHost>
Some static files (like the CSS which is compiled from less) are generated on the fly in the development environment, but must be compiled when deploying the production environment, in order to serve them faster:
./web/app deploy
Remember to call that command every time you deploy a new version.
Build the index and start the search daemon:
cd sphinx indexer --all searchd
Note that we omitted the --console option, in order to make searchd run in the background. However instead of just calling searchd on the command line, it would be even better to set up an init script to start and stop Sphinx.
There is rarely a need to restart the search daemon. When you have deployed a new version of the cocktail search or when you ran the crawler again, just rebuilt and rotate the index:
cd sphinx indexer --all --rotate
This project is my playground for new web technologies and frameworks. And you are invited to make it your playground as well. The code base is still small and well organized. And setting up the development environment is easy and straightforward.
The easiest way to get involved would probably be to write spiders for more cocktail websites. Most spiders consists only of a few lines of Python code and you don't have to know anything about the rest of the stack. Or you could contribute to the wordforms and synonyms lists, without even any programming skills. Also have a look at the open issues and feel free to fix some of them. I prefer to get pull requests via github, but will also accept patches via email.
You have found a bug and don't want to fix it yourself. Or you have an awesome idea to improve the cocktail search? That's great. Please send me an email or even better use the issue tracker.