Randan

A python package for the analysis of social data

Current version: 0.2.1

Documentation: https://randan.readthedocs.io/en/latest/

If you want to contribute or report a bug, do not hesitate to open an issue on this page or contact us: alexey.n.rotmistrov@gmail.com (Aleksei Rotmistrov), lana_lob@mail.ru (Svetlana Zhuchkova).

Overview

Randan is a python package that aims to provide most of the functions presented in SPSS. Unlike the other python packages for data analysis, it has three main features, which make it attractive for social scientists:

it provides the results of the analysis in a readable and understandable form, similar to SPSS
it gives you information about statistical significance of the parameters whenever possible
it unites all the necessary methods so you do not need to switch between different packages and software anymore

As we emphasize the importance of the way your results look like, we highly suggest to use randan in Jupyter Notebook and store your data in pandas DataFrames.

N.B.: You should understand that this project is under development now, which means it is constantly updating. But you can use all the modules and classes presented in the last release.

Installation

You can easily install the package from the PyPi by running:

pip install randan

If something goes wrong during the installation, consider using this code:

pip install --user randan

To upgrade package's version, run this code:

pip install --upgrade randan

Once you install the package, you can import it as any python package:

# like this
import randan

# or like this
from randan.tree import CHAIDRegressor

# etc.

Structure

By now, seven modules have been included in the package. These modules correspond to the SPSS functions as follows:

Module	Class or function	Corresponding SPSS function	Description
descriptive_statistics	NominalStatistics	Analyze -> Descriptive statistics -> Frequencies, Descriptives, Explore	Descriptive statistics relevant for nominal variables
descriptive_statistics	OrdinalStatistics	Analyze -> Descriptive statistics -> Frequencies, Descriptives, Explore	Descriptive statistics relevant for ordinal variables
descriptive_statistics	ScaleStatistics	Analyze -> Descriptive statistics -> Frequencies, Descriptives, Explore	Descriptive statistics relevant for scale (interval) variables
bivariate_association	Crosstab	Analyze -> Descriptive statistics -> Crosstabs	Analysis of contingency tables
bivariate_association	Correlation	Analyze -> Correlate -> Bivariate	Correlation coefficients
comparison_of_central_tendency	ANOVA	Analyze -> Compare means -> One-Way ANOVA	Analysis of variance
clustering	KMeans	Analyze -> Classify -> K-Means Cluster	Cluster analysis with k-means algorithm
dimension_reduction	CA	Analyze -> Dimension Reduction -> Correspondence Analysis	Correspondence analysis
dimension_reduction	PCA	Analyze -> Dimension Reduction -> Factor (extraction method: principal components)	Principal component analysis
regression	LinearRegression	Analyze -> Regression -> Linear	OLS regression
regression	BinaryLogisticRegression	Analyze -> Regression -> Binary Logistic	Binary logistic regression
tree	CHAIDRegressor, CHAIDClassifier	Analyze -> Classify -> Tree -> CHAID	CHAID decision tree for scale and categorical dependent variables, respectively

Quick start

Although randan is built to be similar to SPSS, it reproduces the fit-predict and fit-transform approach, which is now being used in the most popular machine learning python packages. This approach means that you should, firstly, initialize your model and then, secondly, fit it to your data (i.e., use the fit function) if necessary.

If the method you use belongs to the unsupervised methods (i.e., you do not have a dependent variable in your data), you can then use transform function to get values of the obtained, hidden, dependent variable such as cluster membership, factor scores etc.

If the method you use belongs to the supervised methods (i.e., you have a dependent variable in your data), you can then use predict function to get values of the given dependent variable.

If the method does not assume to estimate new values for your data (such methods are crosstabs, t-tests etc.), then it does not require to use fit and transform / predict functions.

If you want to see the full list of the availiable functions associated with some class, please visit our documentation page or literally ask for help:

from randan.bivariate_association import Crosstab
help(Crosstab)

Module `bivariate_association`

This module aggregates methods devoted to searching for statistical relationships between two variables. These methods do not require to use fit function, i.e. you only need to call the necessary class:

from randan.bivariate_association import Crosstab

# with this code, you will immediately see the results
ctab = Crosstab(data, row='genre', column='age_ord')

# however, if you want to somehow use separate statistics, you can call them this way
print(ctab.chi_square, ctab.pvalue, ctab.n_cells)

Module `comparison_of_central_tendency`

This module contains both parametric and non-parametric methods for comparison of central tendency statistics. These methods do not require to use fit function, i.e. you only need to call the necessary class:

from randan.comparison_of_central_tendency import ANOVA

# with this code, you will immediately see the results
anv = ANOVA(data, dependent_variables='kinopoisk_rate', independent_variable='genre')

# however, if you want to somehow use separate statistics, you can call them this way
print(anv.F, anv.pvalue, anv.SSt)

Module `clustering`

This module includes two main clustering methods: k-means and hierarchical (agglomerative) clustering.

Clustering methods belong to unsupervised learning, which means you should use the fit function after calling the appropriate class, and then, if necessary, the transform function to acquire cluster membership (and / or distances to each center in case of k-means).

from randan.clustering import KMeans

# with this code, you will immediately see the results, including visualization of clusters
km = KMeans(2).fit(data, ['year', 'time', 'kinopoisk_rate_count'])

# this is how you can predict the cluster membership, 
# and the distances from each observation to each cluster's center
clusters = km.transform(distance_to_centers=True)

If you experience troubles with visualization and see captions like <Figure size 800x500 with 1 Axes> instead of plots, just re-run the code that produces them.

Module `dimension_reduction`

This module unites methods for factorization of nominal and scale variables: correspondence analysis (class CA) and principal component analysis (class PCA).

Factorization methods belong to unsupervised learning, which means you should use the fit function after calling the appropriate class, and then, if necessary, the transform function to acquire so-called factor scores.

from randan.dimension_reduction import PCA 
 
vars_ = ['trstprl', 'trstlgl', 'trstplc', 'ppltrst', 'pplfair', 'pplhlp']

# with this code, you will immediately see the results
pca = PCA(n_components=2, rotation='varimax').fit(data, variables=vars_)

# this is how you can predict the factor scores
f_scores = pca.transform()

Module `regression`

This module consists of two classical regression models: linear regression and binary logistic regression. This group of methods belongs to supervised learning, which means you should use the fit function after calling the appropriate class, and then, if necessary, the predict function to acquire predictions.

from randan.regression import LinearRegression

# with this code, you will immediately see the results
formula = 'kinopoisk_rate = time + year + genre + genre*type'

regr = LinearRegression().fit(
    data, 
    formula=formula,
    categorical_variables=['genre', 'type'],
    collinearity_statistics=True
)

# this is how you can predict values of the dependent variable for the given data... 
predictions = regr.predict()

# ... save various types of residuals ...
residuals = regr.save_residuals(unstardandized=False, studentized=True)

# ... and even save values of independent variables 
# if you didn't create them manually (e.g. dummies and interactions) ...
indep_vars = regr.save_independent_variables()

Module `tree`

This module includes various methods of building decision trees. If you have a categorical dependent variable, please use those methods that contain Classifier part in their names. Otherwise, if you have a scale dependent variable, please use the methods that contain Regressor part in their names.

This group of methods belongs to supervised learning, which means you should use the fit function after calling the appropriate class, and then, if necessary, the predict function to acquire predictions.

from randan.tree import CHAIDRegressor

# with this code, you will immediately see the results, including the plot of your tree
chaid = CHAIDRegressor().fit(
    data,
    dependent_variable='kinopoisk_rate',
    independent_variables=['genre', 'age_ord', 'year', 'time', 'type', 'kinopoisk_rate_count'],
    scale_variables=['year', 'time', 'kinopoisk_rate_count'],
    ordinal_variables=['age_ord']
    )

# this is how you can predict values of the dependent variable, the node membership, 
# and the description of the node in terms of interactions for the given data 
predictions = chaid.predict(node=True, interaction=True)

Name		Name	Last commit message	Last commit date
Latest commit History 166 Commits
docs		docs
randan		randan
LICENSE		LICENSE
README.md		README.md
readthedocs.yml		readthedocs.yml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs

docs

randan

randan

LICENSE

LICENSE

README.md

README.md

readthedocs.yml

readthedocs.yml

setup.py

setup.py

Repository files navigation

Randan

Overview

Installation

Structure

Quick start

Module `bivariate_association`

Module `comparison_of_central_tendency`

Module `clustering`

Module `dimension_reduction`

Module `regression`

Module `tree`

About

Releases

Packages

Contributors 2

Languages

License

RandanCSS/randan

Folders and files

Latest commit

History

Repository files navigation

Randan

Overview

Installation

Structure

Quick start

Module bivariate_association

Module comparison_of_central_tendency

Module clustering

Module dimension_reduction

Module regression

Module tree

About

Resources

License

Stars

Watchers

Forks

Languages

Module `bivariate_association`

Module `comparison_of_central_tendency`

Module `clustering`

Module `dimension_reduction`

Module `regression`

Module `tree`