GitHub - jenniferlhood/pwqmn: A utility for data mining the Ontario Provincial Water Quality Monitoring Network (pwqmn) data

Branches Tags
Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
test_graphs		test_graphs
PWQMN1.csv		PWQMN1.csv
ReadMe.txt		ReadMe.txt
UI.py		UI.py
checklist		checklist
cities.txt		cities.txt
config.txt		config.txt
pwqmn.py		pwqmn.py
pwqmn_data.py		pwqmn_data.py
pwqmn_prog.py		pwqmn_prog.py
Repository files navigation

============================================================
the Provincial Water Quality Monitoring Network Data Utility
============================================================


v 0.1 (beta)
By Jennifer L A Hood, PhD
created April 30, 2015



Description:
===============


The dataset:
---------------


This program is intended to be used to explore the Ontario Provincial Water Quality Monitoring Network (PWQMN) data, which is generated by the Ontario Ministry of Environment (MOE) (Ontario Ministry of Environment and Climate Change). The PWQMN network data is generated by a program involving the MOE and the 36 Ontario Conservation authorities of Ontario, and is publicly available online (http://www.ontario.ca/data/provincial-stream-water-quality-monitoring-network), governed by an open licence (http://www.ontario.ca/government/open-government-licence-ontario).

The PWQMN dataset currently contains roughly 100 chemical, biological and physical parameters measured on roughly 2000 stations on roughly 800 rivers and streams throughout all of Ontario, and has a data beginning in the mid 1960s, though only data since 2000 is available to the public via the MOE website. The MOE have a mandate to monitor and publish data relevant to the determination of many aspects of water quality, river and stream ecosystem status, and the general "health" of all Ontario ecosystems. As far as environmental datasets are concerned, the PWQMN dataset is a large and compressive set, it covers a much larger spatial and temporal scale, and includes many more parameters than a typical environmental study. Thus, it offers opportunities for data mining and exploration to discover patterns that may not have been previously realized or too small to otherwise detect.

As the data is published near-annually as it becomes available, in Microsoft Excel spreadsheets (and Microsoft Access databases). Due to the format of the row entries of the data, data exploration cannot occur without some manipulation of the raw data, and extracting very specific pieces of information is time consuming and prone to errors. Thus I have developed a python tool which provides easy access to the PWQMN dataset for extracting some basic and specific information useful in data exploration and data mining. Future updates to this program will build on these basics and increase the statistical and graphical tools available to users.




The program:
---------------

Currently, the program is a python script that interfaces with R. It was developed in python 2.7.6 and R 3.0.2 on Ubuntu 14.04.2 (trusty). Anyone wishing to use this program must have the following:


Dependencies:

- Python
- The python library: rpy2 (http://rpy.sourceforge.net/)
- R (statistical programming language)
- The R library: ggplot2 (http://ggplot2.org/)


Data:

- PWQMN files converted to .csv (the program does not read .xls files)
- PWQMN station file (includes station metadata)


 Testing outside of the aforementioned versions of python and R, and on other operating systems has not yet occurred. In future updates of this program I will try to reduce or remove these dependencies. Some test data is provided with the program, however users are encouraged to visit the MOE website to obtain the current data files (http://www.ontario.ca/data/provincial-stream-water-quality-monitoring-network)

The program and supporting files are the following:

pwqmn.py		to launch the program
pwqmn_prog.py		the command line program providing user interaction
UI.py			the current command line display
pwqmn_data.py		contains the data class files.
cities.txt		a list of cities used by the program.
config.txt		specifies the location of the data files. 
			-Must be updated with path to data




How To Use
=====================

After all dependencies are installed, the program is launched using the pwqmn.py
(typing: "python pwqmn.py") in the console. The program reads the config file and attempts to load the data in the specified locations. The file locations must be provided in the format specified in the config file.
If no station meta data file or station data files are found, the program quits as there is nothing to do. The program loads all of the specified data, thus some systems may not perform well if all data is loaded.

The user generates a selection from among the loaded data to produce summary stats and graphs.

To see the available commands, use the <cmd> command by typing: cmd
Note that currently the program is case sensitive.

The rest of this guide will follow an example workflow to generate some output.

To get started, two main selection types are required: Some stations must be chosen and some parameters must be chosen. These selections do not need to be made in any particular order. 

To select some stations stations, use the <sel> ("select") command, followed by selection options. Currently users can select stations by the river they are situated on, and by their proximity to a city (those cities provided in the cities.txt file). 

Here is the description of the sel command found inside the program:

---------------------------------------------------------
sel
Use: The select command. Select a subset of stations
     from among all availbale stations. Stats and graphs are
     produced on selected sites.

   -river <river> 	 select by river
   -add <river> 	 append river selection
   -rem <river> 	 remove river from  selection
   -rem zero 		 remove stations with zero observations
   -rem x 		 remove stations with less than x observations
   -city <city> <x> 	 all sites x kms from city
---------------------------------------------------------

To select some stations within 50 km of Toronto, type the following: sel -city Toronto 50
To select stations on a river, type the following sel -river Grand River

After the select command, the program will display your current station selection. You are able to add to the selection or remove stations from the selection. Currently, only adding stations by their respective rivers is supported, making another city selection removes the previous one. Future versions will expand the selection function.
To add a river to the current selection, type: sel -add river Speed River

To see which rivers and cities are available for selection, you can use the <ls> ("list") command. The current description of this command given in the program is as follows:

---------------------------------------------------------
ls
Use: list some aspects of the dataset
     the following options are available:

   -as 		 to list all available stations
   -ss 		 to list selected stations
   -ap 		 to list all available parameters
   -sp 		 to list the selected parameters
   -ar 		 to list all rivers
   -sr 		 to list rivers of selected stations
   -ac 		 to list all available cities
---------------------------------------------------------

So, to see the available cities, type: ls -ac
To see the available rivers, type: ls -ar
To see the rivers in the current selection, type ls -sr


Secondly, some parameters must be selected. To see the available list of parameters, type: ls -ap
Currently the parameter descriptions are not displayed (this will be changed in an update), however they are available in the data files themselves.
To add a parameter to the selection, use the <sparm> command "select parameters". The follow is the description of this command given by the program:

---------------------------------------------------------
sparm
Use: The select parameters command. Select a subset of
     parameters from among all available parameters. Stats
     and graphs are produced on selected parameters.
	  the following options are available: 

   -top x 		 top x most observed parameters
   -add <param> 	 add parameter to selected
   -add <year> 		 add year to selected years
   -rem <param> 	 remove parameter from selected
   -rem <year> 		 remove year from selected years
   
 ---------------------------------------------------------
 
 So, to add Total Phosphorus to the selected parameters, type: sparm -add PPUT
 (PPUT is the parameter code for Total Phosphorus). 
 To add total alkalinity to the selection, type: sparm -add ALKT
 To add the 5-day biological oxygen demand (BOD5) to the selection, type: sparm -add BOD5
 To remove this parameter, type: sparm -rem BOD5
 
 To see which parameters are in the current selection, type: ls -sp
 This will give you the list of parameter codes along with the number of observations made for that parameter in either the currently selected stations, or all stations if there are none selected.
 
 
 Now that some stations and parameters are selected, graphs and summaries can be produced. use the command "graph" to produce some graphs. The follow graphs and options are currently available:
 
---------------------------------------------------------
graph
Use: produce graphs on selected sites and parameters

   -bar all <filename> 		 bar plot of number of observations
   -bar river <filename> 	 bar plots by river
   -hist <filename> 		 histograms for selected parameters
   -box m <filename>  		 box plots by month
   -box y <filename>  		 box plots by year
   -box r <filename>  		 box plots by river (all dates)
   -scatter <filename> 		 scatter plots for parameters
---------------------------------------------------------

To generate a boxplot separated by the month of observation, type: graph -box m test
in this example, we have two parameters selected (PPUT and ALKT), and all stations on the Grand River and Speed River. The boxplot command will generate one plot per parameter, with one box per month of collection. These plots will be stored in the main directory under the specified file name ("test", in this example), with the parameter and plot type appended to the name. Subsequent calls to the the graph -box m command with the same file name will overwrite the existing graphs.

To generate a scatter plot of these two selected parameters, type: graph -scatter test
Scatter plots can only be generated when at least two parameters are selected.


To see some basic statistics on the selected stations and parameters, use the <stat> ("statistics") command. The follow is currently supported:


---------------------------------------------------------
stat
Use: produce basic statistics on selected sites and parameters

   -s <filename> 	 produce summary output on selected

---------------------------------------------------------

To generate some output on the current selection, type: stat -s test

This will generate a text file with useful information on the selection, including a log of what was typed to obtain the select, the number of stations and rivers in the selection, the number of observations (n), mean, median and standard deviation of each parameter.
If two or more parameters are in the selection, a pearson's correlation will be conducted, and the output will include the pearson's correlation coefficient and P-value will be included in the summary for each parameter pair.

To end the session, type q

To see what is coming in future versions, see the checklist. To make suggestions, report bugs or to otherwise comment on this program, please contact me at the email address listed on my github profile.