A scraper for a KT Web interface for documents

Tested on City of Tampere.

Install

sudo apt-get install python-pip python git abiword tesseract-ocr tesseract-ocr-fin wv ghostscript python-imaging python-dev libxml2-dev libxslt1-dev zlib1g-dev libjpeg62 libjpeg62-dev
git clone https://github.com/jensfinnas/ktweb-scraper
cd ktweb-scraper
pip install -r requirements.txt

You will also need to put your Amazon AWS credentials in ~/.aws, as per https://aws.amazon.com/developers/getting-started/python/

Command line usage

To start scraping:

python run.py

To get help:

python run.py --help

Using the scraper as a Python module

Basic initialization.

from modules.site import Site

site = Site("http://ktweb.tampere.fi/ktwebbin/dbisa.dll/ktwebscr/")

Get a list of all available decision-making bodies.

print site.bodies()

Get a list of all upcoming or past (or both) meetings from a given body.

print site.upcoming_meetings("Kaupunginhallitus")
print site.past_meetings("Kaupunginhallitus")
print site.meetings("Kaupunginhallitus")

You can also choose to only get meetings after a specific date.

print site.meetings("Kaupunginhallitus", after_date="2016-06-01")

Meetings have two kind of documents: agendas ("esityslista") and minutes ("pöytäkirja"). You can get those using meeting.agenda() and meeting.minutes(). Or both using meeting.documents()

for meeting in site.meetings("Kaupunginhallitus"):
    for doc in meeting.documents():
    	print doc

Documents can also be downloaded.

doc.download()

By default documents are downloaded to a tmp folder with an autogenerated file name. Override these defaults with:

doc.download(file_name="my_file.pdf", folder="myfolder")

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
modules		modules
temp		temp
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
example.py		example.py
requirements.txt		requirements.txt
run.py		run.py
settings.default.py		settings.default.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

modules

modules

temp

temp

.gitignore

.gitignore

README.md

README.md

init.py

init.py

example.py

example.py

requirements.txt

requirements.txt

run.py

run.py

settings.default.py

settings.default.py

Repository files navigation

A scraper for a KT Web interface for documents

Install

Command line usage

Using the scraper as a Python module

About

Releases

Packages

Contributors 2

Languages

jensfinnas/ktweb-scraper

Folders and files

Latest commit

History

Repository files navigation

A scraper for a KT Web interface for documents

Install

Command line usage

Using the scraper as a Python module

About

Resources

Stars

Watchers

Forks

Languages