dbscan_gadm

Fun with DBSCAN algorithm and GADM-geocoded points of interest.

This reads in a data set of coordinates (latitude and longitude) along with geocoded Global Administrative Area features for these, then performs unsupervised learning to cluster these into zones of interest based on geographic features using the Density-Based Spatial Clustering of Applications with Noise algorithm with a customized distance function.

Usage

Update options in settings.py and run from the command line:

$ python dbscan_gadm_metric.py

Custom Distance Metric Modes

You can use one of three modes to calculation the distance between points for DBSCAN clustering

Vincenty Basic

Custom distance metric using Vincenty's Formula.

Set MODE = 'vincenty-basic' in settings.py.

Vincenty+GADM Features

Custom distance metric that combines Vincenty's Formula with GADM features to calculate a scored distance (in km). The metric starts with a base Vincenty's Forumal distance calculation, then modifies this based on whether the two points are in the same city and or city neighborhood.

This is just one illustrative method of using GADM features to modify distance. It is "magic numbery" for simplicity. In real-life one would derive values for GADM feature weights -- or use the full proxy method.

Set MODE = 'vincenty-gadm' in settings.py.

Proxy Distance Calculation

Custom distance metric that uses a simple proxy ID to fetch attributes from an external data set (for illustrative simplicity in this case, we use the passed POI dataset)

While this Proxy approach replicates the same distance formula of Vincenty-plus-GADM it could be modified to support ANY distance formula. For example, rather that using GADM features one could instead extract a key or GUID used to look up a whole array of features used for a custom distance calculation (even to make a REST call to a route management system to get true driving times between each X and Y).

Set MODE = 'proxy' in settings.py.

Sample Results

A sample data file of 10 points of interest is provided in the /sample-data folder to illustrate the results of different DBSCAN distance calculations. These points were selected to show different clustering results based on whether or not GADM features are used.

Here is a map of all 10 points:

Vincenty Basic

When looking at the 10 points from a simple human "how would you cluster these?" perspective, it appears we have five clusters. This is exactly how the the Vincenty Basic distance formula clusters these (when set to allow clusters of 1+ points and to treat anything within a 1.0 km radius as candidates in the same cluster:

Model Performance and Metrics
================================================================================
Estimated number of clusters: 5
Homogeneity: 0.676
Completeness: 1.000
V-measure: 0.807
Adjusted Rand Index: 0.000
Adjusted Mutual Information: 0.000
Silhouette Coefficient: 0.930

You can see the raw results here. A matplotlib view shows a Mercator projection of five separate clusters:

This appears to "make sense" from a common common sense POV.

Adding GADM Features

However, lets see what happens if we use Vincenty distance modified by GADM features to determine clusters. In this case we have applied two rules based on a very simplistic urban model:

Urban travel is harder than non-urban. If two points are in the same city, and the city is large enought to have distinct GADM neighborhoods (a.k.a. localities), then provide a 100% distance penality.
However, assume that intercity neighborhoods are defined based on natural human clusters. As such, if two points are in the same city AND are in the same neighborhood locality, then give a 20% distance bonus.

Applying these rules yields 7 clusters (vs. 5):

Model Performance and Metrics
================================================================================
Estimated number of clusters: 7
Homogeneity: 0.819
Completeness: 1.000
V-measure: 0.901
Adjusted Rand Index: 0.000
Adjusted Mutual Information: 0.000
Silhouette Coefficient: 0.467

You can see the raw results here. Here is the matplotlib Mercator view:

This view shows two places where simple Vincenty clusters were divided: Arlington, VA and central Washington, DC:

Let's look at Arlington

The two POIs in Arlington are within 1 km of each other but in two different neighborhoods (Virginia Square and Waverly Hills). These neighborhoods are separated by Route I-66: Adding the modified GADM rule allowd DBSCAN to "recognize" this using neighborhood as a human proxy input.

Let's look at Washington, DC

Adding GADM features split the three Washington POIs into two clusters (even though all were within 1 km of each other). One cluster was to the west of The White House (in Northwest Washington). The other was north (in Downtown): Anyone who has driven around DC will tell you this makes sense also, as the The White House creates lots of detours.

Intra-neighborhood Traffic

As expected, using GADM features kept points that are close together AND in the same intercity neighborhood together. Here is Old Town, Alexandria:

Options

Currently the program defines options in a settings.py file:

Program Settings

Setting	Description	Example Values
`INPUT_FILE`	Source data file of coordinates	See below
`OUTPUT_FILE`	Output data file name	Qualified filename with path
`ZOA_SUMMARY_TO_SCREEN`	Print ZOA summary to screen	`True`, `False`
`MATPLOT_ZOA_CLUSTERS`	Use `matplotlib` to graph clusters	`True`, `False`
`MODE`	Custom distance metric formula to use. See below.	See above
`DEFAULT_RADIUS`	Default ZOA radius for DBSCAN epsilon, in km	`1.0`
`DEFAULT_ROUNDING`	Default rounding in decimal places, for GPS coordinates.	`4`
`LOCAL`*	Adjustment factor for coordinates in same neighborhood	`0.8`
`X_TOWN`*	Adjustment factor for coordinates in same city	`2.0`

*These are settings are ignored in vincenty-basic mode. See Sample Results above for discussion of use of LOCAL and X_TOWN settings.

CSV File Settings

The script can take in any CSV file of POIs, as long as the file contains the requisite data points. A sample is provided in the /sample-data folder.

If you use your own file, simply set column names in the dictionary KEY values in the settings.py file as follows

Key	CSV Column name of	Null Allowed in CSV?
`LAT_KEY`	Latitude (in decimal degrees)	NO
`LNG_KEY`	Longitude (in decimal degrees)	NO
`ADDR_KEY`	Single-line Address (e.g., "77 Massachussetts Avenue, Cambridge, MA 02139")	Yes
`NBHD_KEY`	GADM Neighborhood Name	Yes for `vincenty-basic`
`CITY_KEY`	GADM City Name	Yes
`NAME_KEY`	POI Location Name	NO
`ZOA_KEY`*	ZOA label you wish for output	N/A

*Not part of INPUT_FILE. Used to create OUTPUT_FILE.

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
sample-data		sample-data
sample-results		sample-results
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dbscan_gadm_metric.py		dbscan_gadm_metric.py
gadm.py		gadm.py
requirements.txt		requirements.txt
settings.py		settings.py
utils.py		utils.py

License

JimHaughwout/GADM_DBSCAN

Folders and files

Latest commit

History

Repository files navigation

dbscan_gadm

Usage

Custom Distance Metric Modes

Vincenty Basic

Vincenty+GADM Features

Proxy Distance Calculation

Sample Results

Vincenty Basic

Adding GADM Features

Let's look at Arlington

Let's look at Washington, DC

Intra-neighborhood Traffic

Options

Program Settings

CSV File Settings

About

Resources

License

Stars

Watchers

Forks

Languages