Skip to content

nicholasareynolds/gamut

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

gamut

gamut is a GUI tool that helps users determine fit a distribution to their univariate datasets, leveraging the full gamut of continuous distributions in SciPy's statistics toolbox. gamut accepts a user-supplied data set of samples (optionally removing outliers). The user specifies continuous distributions, for which probability plot regressions are performed to identify an candidate distributions with which to model the random variable. gamut then calculates the optimally computes the values of a distribution's parameters for the given data using a maximum likelihood estimate. Lastly, the syntax for initiating a SciPy frozen instance of the chosen distrubtion.

Introduction

Background

Probability plotting is a powerful method for quantifying goodness-of-fit with several advantages over other goodness-of-fit tests. Unlike with the Chi-Square goodness-of-fit test, probability plotting does not require that the samples be grouped into bins, whose size may impact the goodness-of-fit. Unlike with Kolmogerov-Smirnov goodness-of-fit test, wherein the values of a candidate distribution's parameters must be known a priori, the samples can be used to estimate the values of those parameters in probability plotting. Lastly, unlike with Anderson-Darling goodness-of-fit test, probability plotting does not depend on pre-tabulated values specific to each distribution and significance level.

Instead, probability plotting relies on order statistics and a distribution's cumulative distribution function (CDF) to determine the goodness-of-fit. The ordered samples are plotted against the ordered statistic medians. In a linear regression of this data, the slope of this plot is an estimate of the scale factor of the distribution, the intercept is an estimate of the location factor of the distribution, and the coefficient of determination (R^2) is a measure of the goodness-of-fit.

Getting Started

Prerequisities:

In order to use gamut, the Python 3 interpreter is needed. This can be downloaded directly from python.org, or more conveniently, as part of a bundled Python package (e.g. Enthought Canopy, Anaconda, etc...) Furthermore, several supporting libraries are required:

Setting up gamut

From the gamut GitHub repository either Clone or download the repository.

Add the destination directory path (destpath) to the PYTHONPATH:

For Unix Systems:

export PYTHON=$PYTHONPATH:destpath

For Windows systems, add the directory to the PYTHONPATH using the set command.

set PYTHONPATH=%PYTHONPATH%;destpath

Usage/Execution

From the command line, enter the command

python gamut.py

Basic Workflow

A user provides a set of samples to gamut and selects which distributions he/she would like considered as candidate distributions for modeling the data. gamut performs a probability plot linear regression of the data, and which yields a coefficient of determination (R^2) and can be used identifying distributions that can be used to model the data set. Once an ideal distribution has been identified, the values of its parameters are computed for the given samples using a maximum likelihood estimate (MLE).

Importing Data Set

A user import his/her data set by clicking on the Select File button in the gamut window, followed by navigating to the file location. The data must be organized in a comma-separated values (.csv) format. Samples can be listed on one or more rows and in one or more columns in the file; gamut will flatten all values into an array.

Removal of Outliers

gamut optionally removes outliers using the generalized extreme Studentized deviate (ESD) test (an iterative version of the Grubb's, or maximum normed residual, test). In order to remove outliers, a user clicks on the Outliers Settings button, and checks the Remove Outliers checkbox. He/she is then prompted to enter the significance level to be used in detecting and eliminating the outliers from the data set. As a note, generalized ESD test is a two-sided test assumes the data can be approximated by the normal distribution.

Changing the outlier settings after candidate distributions have already been selected will reperform the probability plot and MLE fit operations for all existing candidate distributions.

Identifying Distributions that Follow the Data Set (Probability Plotting)

A user selects distributions in the SciPy Distributions portion of the window to be candidate distribution. gamut supports all the continuous distributions in SciPy. Every distribution has an associated scale factor and location factor; however, the number of shape factors varies from distribution to distribution. gamut will enable shape factor entry boxes for each of the shape factors of the highlighted distribution; the user is responsible for filling in these values. This allows a user to consider different shape factors for the same distrubution as different candidate distributions

Note: for distributions with only one shape factor (e.g. lognorm or frechet_r (i.e. weibull)), users have the option of specifying the bounds of the shape factor by clicking on the Calculate PPCC button. The optimal shape factor will then be calculated using the scipy.stats.ppcc_max function, which performs an analysis with a probability plot correlation coefficient on a given data set.

A user adds a selected candidate distribution to the list of considered distributions by either double-clicking the distribution (if all the shape factors are entered), or by clicking the Add distribution. This will add a row in the Probability Plotting section of the window. Similarly, one or all of the distributions can be removed from this section by clicking on its entry in the Probability Plotting section and clicking Remove or Removal All, respectively.

In the Probability Plotting section of the window, the values computed from the probability plot linear regression are displayed. gamut employs the probability plotting function scipy.stats.probplot to perform the linear regression. This method employ's Filliben's estimate [1] of order statistic medians (i.e. quantiles). The values computed from this function include the R^2 value, the scale factor, and the location factor. The shape factors correspond to values entered by the user. The closer the R^2 value is to 1.00, the more closely that distribution follows the data set. To see the probability plot (ordered samples vs. ordered statistical medians) of a distribution, double click on the that distribution's entry in the table. The user can optionally save this probability plot as a portable network graphics (PNG) image.

Fitting the Data (Maximum Likelihood Estimate)

Lastly, the shape, location, and scale parameters are calculated using a maximum likelihood estimate (MLE), as implemented by the fit method of the continuous distributions in SciPy. These values are what are used in displaying the probability density function (PDF) and cumulitive density function (CDF) when clicking the PDF/CDF button and in the syntax to instantiate a frozen distribution in SciPy by clicking on the SciPy Call button.

Administrative

License

gamut is licensed under the MIT License - see the LICENSE file.

Citing/acknowledgement

As a courtesy, please acknowledge gamut in papers, reports, or publications, for which gamut was used.

Contact

gamut is by no means a completed project, or limited to contributions by the author. If you wish to suggest changes, provide feedback, or even collaborate, please contact me at (nicholas.a.reynolds@gmail.com).

Acknowledgements

gamut has not implemented any new scientific concepts; it has merely implemented methodologies that I learned in grad school and are openly available. In becoming more acquainted with Python in general and SciPy in particular, I organized these tools in what I consider to be a convenient workflow to scientists/engineers who are not regularly involved in uncertainty quantification.

That said, NIST's Engineering Statistics Handbook was an excellent resource in preparing gamut. The infrastructure laid out by Travis Oliphant and the SciPy Developers in SciPy's statistics toolbox made developing this tool a very straightfoward endeavor.

References

  • [1] Filliben, J. J. (February 1975), The Probability Plot Correlation Coefficient Test for Normality, Technometrics, pp. 111-117.

About

gamut helps users fit a distribution to their datasets, leveraging the full gamut of continuous distributions in SciPy's statistics toolbox.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages