Skip to content

General utilities for survey analysis. Such as: fill missing values with median; unfold respondent-per-row into question-per-row.

Notifications You must be signed in to change notification settings

paepcke/survey_utils

Repository files navigation

General Utilities for Survey Analysis

The survey_tools package includes facilities needed for analyzing survey result data. Facilities include table reshaping, missing-value replacement, selection of respondents who answered fewer than X% of the questions, as well as some plotting facilities.

[TOC]

Unfolding Tables

Survey results often arrive with a row holding data from one respondent. What is needed for many stats analyses is one row for each question, where the responses to one row's question occupy one column for each respondent.

The unfold() method of TableShaper provides this folding. Input can either be a .csv file, or a 2D Python array. Outputs may be:

  • directed to a new .csv file
  • written to stdout (the default)
  • retrievable as from an iterator: next()

The unfold service may be invoked from Python code, or from the command line.

In reshaping, some columns may be retained, others discarded. Consider the following example:

userId question questionType timeAdded answer
10 DOB pullDown Jun2010 1983
10 gender radio May2011 F
20 DOB pullDown Jun2010 1980
20 gender radio May2011 M
                  ...

Minimally we want this table to be:

question v1 v2
DOB 1983 1980
gender F M

In this most minimal (but often sufficient) result, questionType and timeAdded are dropped. Values of the question column are distributed across new columns. Each vn column holds answers by one respondent to all questions. Rows now hold information about one question, no longer about a respondent.

Terminology:

  • The unfold column is the column whose values will turn into columns.
  • Unfold values are the values that are initially in the unfold column, and which will make up the new columns.
  • Constant columns are columns that will remain columns in the final outcome.
  • Column-name provider is a column whose values will be used as the names of the new columns. Often a respondent ID will be appropriate for this role. If no such column is provided, unfold() creates names.

Let the unfold column be question, and the constants columns be questionType and timeAdded. You could call the function like this:

shaper = TableShaper()
shaper.unfold('/tmp/in.csv', 
       	      col_name_to_unfold='question'
       	      col_name_unfold_values='answer'
       	      constant_cols=['questionType','timeAdded'])

The reshaped table looks like this:

question questionType timeAdded v1 v2
DOB pullDown June2010 1983 1980
gender radio May2011 F M

Note that in this example the constant columns are questionType and timeAdded, and they are retained. It is an error to have inconsistencies in the constant columns. For instance, if the original row

"20 DOB pullDown..."

had been

"20 DOB radio"

an error would have been raised. All constant columns field values for the same question (in different rows of the original) must match.

Another way to call the function controls the names of the new columns. One column can be specified to provide the column headers:

shaper.unfold('/tmp/in.csv',
       	      col_name_to_unfold='question'
       	      col_name_unfold_values='answer'
       	      constant_cols=['questionType','timeAdded'],
       	      new_col_names_col='userId)

The reshaped table would look like this:

question questionType timeAdded 10 20
DOB pullDown June2010 1983 1980
gender radio May2011 F M

I.e. the user id values are used as the column headers of the new table.

To have the function behave like an iterator (each item will be an array with one row of the reshaped table):

it = unfold('/tmp/in.csv',
           col_name_to_unfold='question'
           col_name_unfold_values='answer'
           constant_cols=['questionType','timeAdded'],
           out_method=OutMethod.ITERATOR)
for row in it:
    print(row)

To write the output to a file:

unfold('/tmp/in.csv',
       col_name_to_unfold='question'
       col_name_unfold_values='answer'
       constant_cols=['questionType','timeAdded'],
       new_col_names_col='userId,
       out_method=OutMethod('/tmp/trash.csv')

Finally, to use the unfold facility from the command line:

prompt> python src/survey_utils/unfolding.py --h
usage: unfolding.py [-h] [-c CONSTANTCOL] [-n NEWCOLNAMECOL]
                    table_path col_to_unfold col_of_values

positional arguments:
  table_path            Path to .csv file
  col_to_unfold         Name of column whose values are to be new columns
  col_of_values         Name of column whose values will be the values in the new columns.

optional arguments:
  -h, --help            show this help message and exit
  -c CONSTANTCOL, --constantCol CONSTANTCOL
                        Column(s) to keep; all others except the unfold
                        column will be discarded. Use as often as needed.
  -n NEWCOLNAMECOL, --newColNameCol NEWCOLNAMECOL
                        Column that will supply names for new columns 
                        (e.g. 'userId'); if not provided, the new cols 
                        will be 'v1','v2',...

####Replacing Missing Values

Given either a numpy ndarray, or a Pandas DataFrame, you can replace missing values. Options are to replace missing values with the:

  • mean of the value's row,
  • mean of the value's column,
  • median of the value's row,
  • median of the value's column,

In addition, you can specify what is considered a missing value. Options are:

  • numpy.nan
  • numpy.inf
  • numpy.posinf
  • numpy.neginf
  • any other Python value.

Example: given numpy.ndarray self.arr:

A B C D
1 2 3 13
4 0 6 14
4 8 9 15
10 11 12 16

Can use:

res = replaceMissingValsNparray(self.arr, 
                                direction='column',
                                replacement='median',
                                missing_value=0)

to get:

A B C D
1 2 3 13
4 8 6 14
4 8 9 15
10 11 12 16

Notice that the zero in arr[1,1] was replaced by the median of the column in which the zero resided: MEDIAN(2,8,11). The zero itself is disregarded for the median computation.

Instead of setting replacement to 'median', it can be specified as 'mean,' resulting in:

A B C D
1 2 3 13
4 7 6 14
4 8 9 15
10 11 12 16

The direction parameter can be set to 'row', in which case the mean/median are taken across, instead of top to bottom.

In addtion to replaceMissingValsNparray(), which works on numpy.ndarray structures, a corresponding replaceMissingValsDataFrame() function works on Panda DataFrames.

####Dendrograms

Function fancy_dendrogram() displays hierarchical clusters in visual form. The function is from a dendrogram tutorial with some additional documentation in the code header.

Here is how to use the facility.

def test_fancy_dendrogram(self):
    '''
    Generates a dendrogram in a new window.
    '''
    # generate two clusters: a with 100 points, b with 50:
    np.random.seed(4711)  # for repeatability of this tutorial
    a = np.random.multivariate_normal([10, 0], [[3, 1], [1, 4]], size=[100,])
    b = np.random.multivariate_normal([0, 20], [[3, 1], [1, 4]], size=[50,])
    X = np.concatenate((a, b),)

    # generate the linkage matrix
    Z = linkage(X, 'ward')

    fancy_dendrogram(
        Z,
        truncate_mode='lastp',
        p=12,
        leaf_rotation=90.,
        leaf_font_size=12.,
        show_contracted=True,
        annotate_above=10  # useful in small plots so annotations don't overlap
        )

    plt.show()

Example dendrogram

Installation

You can install via pip, or via cloning github. Using pip:

pip install survey_tools

is simple, but will install all packages. Some require large packages, such as scipy, others are much sparser. If you clone the github repo:

git clone git@github.com:paepcke/survey_utils.git

You can then:

python setup.py install table_utils
python setup.py install math_utils
python setup.py install plotting_utils

Or, to install all:

python setup.py install

Testing:

python setup.py test table_utils
python setup.py test math_utils
python setup.py test plotting_utils

Or, to test all:

python setup.py test

Note: The above options are in order of installation volume. Since scipy, numpy, and matplotlib are not handled well by pip, the installation assumes that if those packages are needed, you install them separately ahead of time. The easiest is to use Anaconda virtual environments, which know about these modules natively. But instructions are on the Web.

About

General utilities for survey analysis. Such as: fill missing values with median; unfold respondent-per-row into question-per-row.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published