- Date
15 May 2009
- Author
Anand Patil
- Contact
- Web site
github.com/malaria-atlas-project/generic-mbg
- Copyright
Anand Patil, 2009.
- License
GPL, see GPL in this directory.
The generic MBG package allows us to write PyMC probability models for each project that works with some kind of spatial GLM, then turn the model over to the project team for testing, fitting, mapping and experimentation using a few easy shell commands:
mbg-infer
runs the MCMC algorithm using the given model & an input dataset, stored in a csv file, and stores the traces in an HDF5 archive.mbg-map
takes the HDF5 archive produced by mbg-infer, and a raster with some pixels missing. Produces a set of bespoke summary maps of each predicted quantity that match the raster in terms of grid, missingness pattern and file format.mbg-3dmap
takes the HDF5 archive produced by mbg-infer, and a raster with some pixels missing. Outputs the full probability density function of each predicted quantity over a thinned version of the raster, and stores it under compression in HDF5 format. This file can be opened and examined graphically using MayaVI.mbg-validate
takes the HDF5 archive produced by mbg-infer and a 'holdout' dataset, stored in a csv file, and creates a set of predictive samples at the holdout locations and some validation plots.mbg-decluster
partitions a CSV datafile into 'kept' and 'holdout' portions.mbg-describe-tracefile
examines an HDF5 archive produced by mbg-infer, and tells you when it was produced, which versions of the code were used, how many iterations it contains and what the input data were.
If the project's members are interested in changing the model or specifying a subjective prior, there are two additional shell commands available to help:
mbg-scalar-priors
draws samples from the prior for all scalar parameters (including deterministics) and plots histograms for inspection.mbg-realize-prior
draws all scalar parameters from the prior, and realizes and plots the random field on grids matching a number of input ASCIIs.mbg-describe-tracefile
provides information about the circumstances under which traces were produced.
All shell commands can be run with only the -h
option to print some help to the screen. However, if you're reading this document, you don't really need to do that.
For documentation on how to install and administer the package & write specializing modules, run builddocs.py
in the docs
directory and then see manual.pdf
.
If you want to use the shell commands, this section is for you.
mbg-infer module database-file input [options]
Produces the requested database file. Also produces plots of the dynamic traces of all scalar parameters as PDF's, and saves them in the folder name-plots
, where name
is the name of the database file. You will need to inspect these plots to determine how many 'burnin' iterations should be discarded when making maps.
If you determine that more MCMC samples are needed, simply run mbg-infer with the same database file argument to pick up where you left off and keep sampling.
- The name of the module containing the model specification.
- The name of the database file to be produced. If you do not want it to go in the current directory, specify a path, eg
/home/anand/traces/run-01-04-2009
. If the database file already exists, you will be prompted about whether you want to continue sampling into it or remove it. The name of a csv file containing the input data. If it is a different directory, specify the path to it, eg
/home/anand/data/query-01-04-2009.csv
. This csv file must have the following columns:lon
,lat
: The coordinates of the observation in decimal degreest
: Time in decimal years. This is only required for spatiotemporal models.
All other columns are interpreted as covariates, eg
ndvi
etc., UNLESS the module implements thenon_cov_columns
attribute. For example, MBGWorld expects lo_age, up_age columns, pos and neg columns, but does not interpret them as covariates.
-t
or--thin
: If thin is 10, every 10th MCMC iteration will be stored in the database. Small values are good but slow. 1 is best.-i
or--iter
: The number of MCMC iterations to perform. Large values are good but slow.
mbg-describe-tracefile path
If path is a database file, inspects the database file. Prints out the version of the generic package, the module that produced the file and the date the run was started. Writes the input data to csv with filename database-file-input-csv
, substituting the actual filename.
If the path is a directory, walks the filesystem starting from the directory, inspecting every database file it finds. Does not produce any csvs.
- The name of the database file or path to be inspected.
mbg-covariate-traces module database-file [options]
Postprocesses the given database file to produce MCMC traces for the covariate coefficients. Produces a directory called database-file-covariate-traces, and populates it with pdf images of the covariate coefficient traces and
- The name of the module containing the model specification.
- The name of the database file containing the MCMC trace.
-t
or--thin
: If thin is 10, samples of the covariate coefficients will be produced for every 10th MCMC sample. Defaults to 1, meaning no thinning.-b
or--burn
: Samples of the covariate coefficients will begin after this many 'burnin' iterations are discarded. Defaults to 0, meaning no burnin.
mbg-decluster input prop [options]
A wrapper for the R function getdeclusteredsample that results in two new tables with suffix HOLDOUT and THINNED outut to same directory as tablepath
- (string) path to input table. must include columns 'lon' and 'lat'. If also 't' will treat as space-time. If only filename given (no path) assumes file in current working directory.
- (float) what proportion of the full data set will be used for hold-out set.
-m
or--minsample
: (int) optional minimum sample size (supercedes prop. if larger)-d
or--decluster
: (logical) do we want to draw spatially declustered sample (default) or just simple random.-p
or--makeplot
: (logical) do we want to export a pdf map showing location of data and selected points. This is exported to same directory as tablepathoptional minimum sample size (supercedes prop if larger).
mbg-map module database-file burn mask [options]
Produces a folder called name-maps
where name
is the name of the database file. Puts the requested maps in the folder in format matching the mask. Also produces PDF images of all the requested maps for quick viewing.
- The name of the module containing the model specification.
- The name of the database file (produced by mbg-infer) to be used to generate the maps. If you do not want it to go in the current directory, specify a path.
- The number of burnin iterations to discard from the trace before making the maps. You will need to figure this out by inspecting the traces produced by
mbg-infer
. - The name of a raster, without extension. The maps will be produced in raster files in the same format, on identical grids, with identical missing pixels. If the file is in a different directory, specify the path to it.
-n
or--n-bins
: The number of bins to use in the histogram from which quantiles are computed. Large values are good, but use up more system memory. Decrease this if you see memory errors.-b
or--bufsize
: The number of buffer pixels to render around the edges of the continents. Set to zero unless theraster-thin
option is greater than 1. The buffer will not be very good. In general, if you want a buffer you're better off making your own in ArcView rather than using this option.-q
or--quantiles
: A string containing the quantiles you want. For example,'0.25 0.5 0.75'
would map the lower and upper quartiles and the medial. Default is'0.05 0.25 0.5 0.75 0.95'
.-t
or--thin
: The factor by which to thin the MCMC trace stored in the database. If you use-t 10
, only every 10th stored MCMC iteration will be used to produce the maps. Small values are good but slow. 1 is best.-i
or--iter
: The total number of predictive samples to use in generating the maps. Large values are good but slow. Defaults to 20000.-p
or--raster-path
: The path to the files containing the covariate rasters. These files' headers must match those of the input raster, and their missing pixels must match those of the input raster also. There must be a file corresponding to every covariate column in input 3 of mbg-infer. For example, if you usedrain
andndvi
as your column headers, filesrain.asc
andndvi.flt
andtemp.hdf5
should be present in the raster path. Defaults to the current working directory.-y
or--year
: If your model is spatiotemporal, you must provide the decimal year at which you want your map produced. For example, Jan 1 2008 would be-y 2008
.
mbg-3dmap module database-file burn mask [options]
Produces a folder called name-3dmaps
where name
is the name of the database file. Puts a HDF5 file- containing the probability density field of the output of each function in the specializing module's map_postproc
list in the folder. This data can be examined interactively using MayaVI. File display_3dmap.py
, included with the package, provides a template for scene generation.
- The name of the module containing the model specification.
- The name of the database file (produced by mbg-infer) to be used to generate the maps. If you do not want it to go in the current directory, specify a path.
- The number of burnin iterations to discard from the trace before making the maps. You will need to figure this out by inspecting the traces produced by
mbg-infer
. - The name of a raster, without extension. The maps will be produced in raster files in the same format, on identical grids, with identical missing pixels. If the file is in a different directory, specify the path to it.
-n
or--n-bins
: The number of bins to use in the histogram from which quantiles are computed. Large values are good, but use up more system memory. Decrease this if you see memory errors.-b
or--bufsize
: The number of buffer pixels to render around the edges of the continents. Set to zero unless theraster-thin
option is greater than 1. The buffer will not be very good. In general, if you want a buffer you're better off making your own in ArcView rather than using this option.-q
or--quantiles
: A string containing the quantiles you want. For example,'0.25 0.5 0.75'
would map the lower and upper quartiles and the medial. Default is'0.05 0.25 0.5 0.75 0.95'
.-t
or--thin
: The factor by which to thin the MCMC trace stored in the database. If you use-t 10
, only every 10th stored MCMC iteration will be used to produce the maps. Small values are good but slow. 1 is best.-r
or--raster-thin
: The 3d data cube takes up much more disk space and memory than the scalar maps. You might need to degrade the input raster to lower resolution. A value of 10 means that the 3d maps will have 1/10 the spatial resolution of the input raster.-i
or--iter
: The total number of predictive samples to use in generating the maps. Large values are good but slow. Defaults to 20000.-p
or--raster-path
: The path to the files containing the covariate rasters. These files' headers must match those of the input raster, and their missing pixels must match those of the input raster also. There must be a file corresponding to every covariate column in input 3 of mbg-infer. For example, if you usedrain
andndvi
as your column headers, filesrain.asc
andndvi.flt
andtemp.hdf5
should be present in the raster path. Defaults to the current working directory.-y
or--year
: If your model is spatiotemporal, you must provide the decimal year at which you want your map produced. For example, Jan 1 2008 would be-y 2008
.
mbg-validate module database-file burn pred-pts [options]
mbg-validate produces a folder called name-validation
, name
being the name of the database file. It populates this folder with two csv files called p-samps
and n-samps
containing posterior predictive samples of the probability of positivity and the number of individuals positive at each prediction location.
It also writes three of the four MBG world validation panels into the folder as PDF's.
- The name of the module containing the model specification.
- The name of the database file (produced by mbg-infer) to be used to generate the maps. If you do not want it to go in the current directory, specify a path.
- The number of burnin iterations to discard from the trace before making the maps. You will need to figure this out by inspecting the traces produced by
mbg-infer
. - A csv file containing the 'holdout' dataset. It should be in exactly the same format as the third required input to
mbg-infer
.
-t
or--thin
: The factor by which to thin the MCMC trace stored in the database. Small values are good but slow. 1 is best.-i
or--iter
: The total number of predictive samples you want to generate. Large values are good but slow. Defaults to 20000.
mbg-scalar-priors module [options]
- The name of the module containing the model specification.
-i
or--iter
: The total number of predictive samples you want to generate. Large values are good but slow. Defaults to 20000.
mbg-realize-prior module ascii0.asc ascii1.asc ... [options]
mbg-realize-prior produces a number of prior realizations of the target surface (eg parasite rate, gene frequency, etc). on several different asciis. Joint or 'conditional' simulations of surfaces are very expensive, so you can only afford to evaluate them on a few thousand pixels.
The multiple asciis are meant to be at multiple resolutions: you can make a coarse one over your entire area of interest, a medium-resolution one on a zoomed-in subset, and a few fine ones over small areas scattered around. That way you can see the large- and small-scale properties of the surface allowed by your prior without having to render the entire surface at full resolution.
Outputs a number of surfaces, evaluated onto the masks indicated by the input asciis. Each set of realizations is coherent across the input asciis; that is, the 'same' surface is evaluated on each ascii. That means you can meaningfully overlay the output asciis at different resolutions.
NOTE: All the parameters of the model will be drawn from the prior before generating each realization. If you want to fix a variable, you must set its observed
flag.
- The name of the module containing the model specification.
- Several ascii files. Realizations will be evaluated on the union of the unmasked regions of these files.
-n
or--n-realizations
: The number of realizations to generate. Defaults to 5.-m
or--mean
: The value of the global mean to use. Defaults to 0.-y
or-year
: If your model is spatiotemporal, you must provide the decimal year at which you want your realizations produced. For example, Jan 1 2008 would be-y 2008
.