GitHub - cbick/gps2gtfs: Match raw AVL/GPS tracking data of public transit vehicles to their GTFS-defined schedules

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
common/src		common/src
config		config
core		core
html		html
nextmuni_import		nextmuni_import
postprocessing		postprocessing
realtime		realtime
.gitignore		.gitignore
LICENSE		LICENSE
README		README
index.html		index.html

Repository files navigation

===============================
===    gps2gtfs README      ===
===============================

A suite of python and SQL tools for matching bus tracking data to GTFS data.

=== License ===

gps2gtfs is released under the MIT license.
See LICENSE in this directory.


=== About ===

At the moment this is a more-or-less scattered collection of tools which you
can use to match GPS tracking data of public transportation vehicles with their
GTFS schedule counterparts. 

The instructions below give a straightforward step-by-step process which
will accomplish this, but as of yet there is no coherent single interface by 
which this is done.


=== Instructions ===

0. Prerequisites

You will need:
  - a SQL database in which you can make tables and such
  - Python
  - Python module for accessing your database (recommend psycopg for postgresql)
  - gtfs_SQL_importer tool available at http://cbick.github.com/gtfs_SQL_importer
  - GTFS data for your agency imported into the database using the above
  - This software

If you are missing any of the first 3:

  - I recommend PostgreSQL. It can be a pain to set up but handles very nicely.
  - Get python from www.python.org . I used version 2.5.x when writing this.
  - You can find python modules for your database from: 
    	http://wiki.python.org/moin/DatabaseInterfaces
    If you have PostgreSQL I recommend you get psycopg2:
        http://initd.org/pub/software/psycopg/



1. Set up python database access

If you're using postgres or other socket-communicating db server:

  In the config folder there is a file called db.config.example. Make a copy
  of this file called db.config and edit it appropriately. 

If you're using anything other than postgres:

  Modify common/src/dbutils.py to handle your DBMS (shouldn't be difficult).



2. Create the schema for bus tracking data.

From the core/sql folder, apply the make_tables.sql file to you database.
Using psql as an example, you would run the command:

  psql mydbname myusername -f make_tables.sql



3. Pull the bus tracking data into the DB.

It is up to you to figure out how to do this for your particular
data source. From the perspective of Python, the goal is to get 
your data points into the format requested by the update_routes() 
method inside the core/sql/dbqueries.py file. Otherwise, you will
need to fit your data into the db scheme, in particular into the
vehicle_track table that was created in step 2.

To grab San Francisco Muni tracking data you can just use the code
that is already provided in the nextmuni_import directory. This is
set up to pull the data directly from the nextmuni xml service.
To use it, follow these steps:

  3.1. Fix erroneous SFMuni GTFS data:
       
       From the nextmuni_import/sql directory, apply the change_sf_gtfs.sql
       file.

  3.2. Insert correct routeid/dirtag matchup data:

       From the nextmuni_import/sql directory, apple the populate_rid_dirtag.sql
       file.

  3.3. Set up a job to collect the data:

       Essentially, you want to run the python script found at
       nextmuni_import/src/route_scraper.py to run every 1 minute
       (or however often you choose), giving it "ALL" as an
       argument:

       python route_scraper.py ALL
       
       To get this to run every minute, the easiest thing is to set up
       a cron job. You're on your own to figure this out, or to figure
       out any method to have this run on a regular basis.

       *** Before you choose how to provide data to the db, see
       *** step 4.2 below.



4. Do some pre-matchup setup.

Now that there is tracking data in your database, you probably want to
match that to the GTFS schedule, since you are using this package. To
do this you first need to do some preprocessing.

4.1. From the core/src directory, run the following:

  python Preprocessing.py trips

This separates the tracking data into distinct routes and trips.

4.2. Then, you have two options:

  a) Populate the routeid_dirtag table manually. (This was already done
     in step 3 if you imported data from San Francisco Muni). The idea
     is to match NextBus' "dirtag" values up with GTFS route ID's.
     The reason to do this manually is that NextBus and GTFS both are
     riddled with errors that you may or may not have caught yet. 

     *** If you aren't using a NextBus data source, then you must choose 
     *** what to do depending on what you provided for the dirtag field in
     *** step 3.

  b) Try to automatically populate the routeid_dirtag table. This will not
     always work, since as I said in (a) there are lots of data errors over
     which I have no control nor which do I have any method to detect.
     You can do this as follows (from core/src directory):

     python Preprocessing.py populate_ridtags

     Before you do this though you might want to check out the results
     from the following SQL queries:

     select count(*), route_short_name as cnt from gtf_routes 
       group by route_short_name having count(*) > 1;

     select count(distinct routetag) from vehicle_track 
       where dirtag != 'null' and routetag != 'null'
       group by dirtag having count(distinct routetag) > 1;
     
     If there are 0 results then you are in good shape. If there are a lot
     of results then you are not in so good shape, and should give serious
     consideration to (a). The data I currently have for SF Muni returns
     121 rows for the second query.


5. Match GPS data to GTFS schedules (hoorah!)

** NOTE: You might want to selectively apply some of the indexes found in
** the core/sql/make_table_indexes.sql file, before doing the following.

From the core/src directory, run:

  python Preprocessing.py match_trips

You have now successfully matched the trips from 4.1 to GTFS scheduled trips.
Now, run:

  python Preprocessing.py gps_schedules

This populates the data telling you the actual schedule that each GPS trip
followed (i.e., when each bus actually arrived at each stop). Specifically,
this data is put into the gps_stop_times table (compare to gtf_stop_times).



6. Make some corrections

Some situations can result in erroneous assignments after following step 5. 
To fix these, run 

  python Preprocessing.py fix_earlybirds

This searches for better matches for routes which are abnormally off-schedule.

*** You can visually test the gtfs<->gps matchings using the OSMViz package,
*** available at http://cbick.github.com/osmviz .
*** Once you have OSMViz, go to the core/src/ directory and look at 
*** the visualizer.py file. By editing the query at the end of the file
*** (inside the "if __name__ == '__main__'" clause), you can choose what
*** to visualize. Run "python visualizer.py" and enjoy the show.


7. Get ready to analyze

From the core/sql folder, you will want to apply the make_table_indexes.sql
file to your database. This will make all future analyses much much faster.



8. Analyze

There is some postprocessing/analysis code in the postprocessing directory.
The postprocessing/sql directory holds code for creating a summary table of
bus lateness data, as well as some various queries that were useful to me
when exploring the data. The postprocessing/src directory holds the Stats.py
file which performs a bunch of statistical analysis and makes plots of the data,
but it requires the SciPy and Matplotlib libraries.