GitHub - rymurr/tfl_track: Scrape and analyse tfl data

Scraper code for TFL website

TODO

start doing summaries of trains and look for patterns
lots of general data exploration, need to understand the dataset
Ways to view, plot, graph, visualize the data. (see below)
Start thinking of predicting problems?
write a doc of my own describing data sets
set up script for deployment to server (may try out virtualization and docker for fun ;-)
monitoring and stats -- scraper should dump size and # of files into rrd on every run. Parser should do the same for records. Metadata on storage locations too.
Monitor size of production hdf5 file
add indicies to hdf5 file when I understand the data better

NOTES

So far I am able to collect the detailed and summary predictions. Everything is going into S3 and a local dir right now. There is the slight concern regarding data size as we are collecting. I should end up collecting at the rate of >100GB/year ($10/month). Thought there is a chance that I can glacier most of that after analysis ($1/month). The xml flattening stuff is very easy but relatively time consuming. The number of rows get enormous quickly and there is a ton of redundant info. I have to start coming up with ideas to shrink the data set into useable bits. The useable stuff should get stuck somewhere more databas-y. What is suitable? Mongo, HDF5, ???

The goal for the secondary storage is to find something that is compact and cheap but can still be properly queried. We first have to define how the filtered and compacted records look like then we have to store them. Mongo is quite expensive for this, HDF5 is quite restrictive. Perhaps one of the newer ones(rethinkDB for example)

Next steps are to increase the information ratio of records. Currently we can derive current position of train, time to each station on the line from each record. So every 30 seconds we know how long to each station and how far the train has moved. We would like to find out how fast it takes a train to move between stations (sections of track). We should also be able to use the time to each station to look ahead to see trouble ahead of that train. We would also like to find out how long trains stay in the station and possibly if trains are congested or not. Being able to recognize congestion, minor delays, major delays is important. I think I should be able to get a lot of info out of the track section field if it increases linearly along the line. Technically that could be related to a distance.

Would like to get the track sections for each station. That could be useful.

We also need to store intermediate results and stats in a db. This will be used to run the website. Here it makes more sense to use a mongodb type system as that is mostly unstructured text that will be shown repeatedly.

Would like to build a few standard visuals and batch analysis jobs. Should plan on making as much stuff reusable so that things can be transformed easily into web stuff.

Plan for a webpage and mobile apps?

ARGH! Lots to do!

IDEAS FOR TRACKING

time each step
cound number of xml files fetched and size
count number of rows per table added
keep track of size of hdf5 object
total size of s3 buckets
total count of xml files
total row count
intelligent and parseable/collectable logs (stdout though so docker sees them?)

THOUGHTS

Is track id linear?
how is velocity/time between stations most accurately defined?
how do I remove the variability caused by the 30 second sample frequency?
meaningful metrics for train speed/times/disruptions etc
correlate track closures to train speeds using other feed

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
app		app
docs		docs
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
getCreds.py		getCreds.py
runApp.sh		runApp.sh
settings.py.template		settings.py.template
supervisor-app.conf		supervisor-app.conf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

app

app

docs

docs

.gitignore

.gitignore

Dockerfile

Dockerfile

README.md

README.md

getCreds.py

getCreds.py

runApp.sh

runApp.sh

settings.py.template

settings.py.template

supervisor-app.conf

supervisor-app.conf

Repository files navigation

TODO

NOTES

IDEAS FOR TRACKING

THOUGHTS

About

Releases

Packages

Languages

rymurr/tfl_track

Folders and files

Latest commit

History

Repository files navigation

TODO

NOTES

IDEAS FOR TRACKING

THOUGHTS

About

Resources

Stars

Watchers

Forks

Languages