LegCo Watch

This project is in active development.

Overview

LegCo Watch is a parliamentary monitoring website in the same vein as openparliament.ca, GovTrack.us, and TheyWorkForYou.

One of the first challenges is that we need to parse much of the data that is stored on the LegCo's official website. Some of it is accessible as XML or in an API, but the majority is in HTML or PDF. The majority of the current code focuses on the parsing, with some basic display of the parsed data.

Technology stack

We use Django for our backend and various Python packages to conduct our scraping. The Pombola and Poplus projects have been very helpful as a guide for how to implement some of our components in Django.

Development Environment

Docker and Docker Compose are supported with all of the required services. Run docker-compose up to bootstrap a development environment. Here's a brief description of what each container does:

Data only containers
- appdata
- dbdata
- logdata
dbserver - Postgres server
rabbitmq - RabbitMQ server
scrapyd - Scrapyd server
appserver - Django application server
worker - Celery worker
scrapydserver - Scrapyd, but don't think you need this anymore

Docker Compose is a tool for configuring docker containers and quickly launching them. docker-compose.yml is the Fig configuration file, and defines how the containers should be set up so that they talk to each other.

To execute Django management commands, use docker-compose run appserver python manage.py.

If you don't want to use Docker, currently you are totally on your own. There are two files that you will definitely need to change besides installing all the requirements.

app/legco/legcowatch/local.py to point to the correct Postgres database.
app/legco/legcowatch/celery.py to point to the correct RabbitMQ instance.

Vagrant and Ansible are not longer used.

Folder structure

Scraping

All of the Scrapy scrapers are stored in app/raw/scraper. The scrapers were intended to kick off their jobs with Celery tasks. The status of these jobs are stored in the Django db. You can find the logic in app/raw/tasks.py.

You can also run the scrapers with Celery, in which case they're just normal Scrapy scrapers. The scrapers' JSON outputs should be saved, and some scrapers will download additional files (e.g. the Hansard scraper).

Parsing

Once data has been downloaded, they are processed by the classes in app/processors. There is also a Celery task that kicks off processing: raw.tasks.process_scrape. You can run the processing manually, but be careful with the paths in the JSON results and the downloaded files -- you may need to fiddle with the processors so that they find the right files.

The processors will take the Raw objects and stick them into cleaned up parsed models.

In addition to the processors, there is a parser for Agenda documents that creates an Agenda class that can be used to extract data out of the Docs. This is in app/raw/docs/agenda.py. It's far from perfect, but it'll get you most of the data.

There is a bit of useful code in app/raw/names.py that helps disambiguate member names. In different parts of the LegCo documents, members can be referred to by their Chinese name, their English, name, with or without their title, or any number of variants. This code tries to build some utility classes that lets you match two people even if their names appear a bit differently. It's pretty naive, but it covers a lot of the cases in the LegCo docs.

There is also the ability to override the results of a parse with user inputs. This model is in raw.models.parsed.Override.

Viewing the results

There is a basic front end that allows you to view the raw and parsed results of scrapes. Start the Django development server, and you should be able to see an index page that lists some of the models and their data.

Current status

I am working on parsing LegCo's Hansard records.

Name		Name	Last commit message	Last commit date
Latest commit History 337 Commits
.ebextensions		.ebextensions
.extensions		.extensions
app		app
bin		bin
datafiles		datafiles
devops		devops
docker/scrapyd		docker/scrapyd
httpcache		httpcache
requirements		requirements
.bowerrc		.bowerrc
.gitignore		.gitignore
.nvmrc		.nvmrc
.venv		.venv
Dockerfile		Dockerfile
LICENSE		LICENSE
NOTES		NOTES
Procfile		Procfile
Profile		Profile
README.md		README.md
SCRAPING		SCRAPING
Vagrantfile.bak		Vagrantfile.bak
bower.json		bower.json
docker-compose.yml		docker-compose.yml
notes-for-mac.md		notes-for-mac.md
provision.sh		provision.sh
requirements.txt		requirements.txt

License

sorpaas/legco-watch

Folders and files

Latest commit

History

Repository files navigation

LegCo Watch

Overview

Technology stack

Development Environment

Folder structure

Scraping

Parsing

Viewing the results

Current status

About

Resources

License

Stars

Watchers

Forks

Languages