def mediation_analysis(data=None, x=None, m=None, y=None, covar=None, alpha=0.05, n_boot=500, seed=None, return_dist=False): """Mediation analysis using a bias-correct non-parametric bootstrap method. Parameters ---------- data : :py:class:`pandas.DataFrame` Dataframe. x : str Column name in data containing the predictor variable. The predictor variable must be continuous. m : str or list of str Column name(s) in data containing the mediator variable(s). The mediator(s) can be continuous or binary (e.g. 0 or 1). This function supports multiple parallel mediators. y : str Column name in data containing the outcome variable. The outcome variable must be continuous. covar : None, str, or list Covariate(s). If not None, the specified covariate(s) will be included in all regressions. alpha : float Significance threshold. Used to determine the confidence interval, :math:`\\text{CI} = [\\alpha / 2 ; 1 - \\alpha / 2]`. n_boot : int Number of bootstrap iterations for confidence intervals and p-values estimation. The greater, the slower. seed : int or None Random state seed. return_dist : bool If True, the function also returns the indirect bootstrapped beta samples (size = n_boot). Can be plotted for instance using :py:func:`seaborn.distplot()` or :py:func:`seaborn.kdeplot()` functions. Returns ------- stats : :py:class:`pandas.DataFrame` Mediation summary: * ``'path'``: regression model * ``'coef'``: regression estimates * ``'se'``: standard error * ``'CI[2.5%]'``: lower confidence interval * ``'CI[97.5%]'``: upper confidence interval * ``'pval'``: two-sided p-values * ``'sig'``: statistical significance See also -------- linear_regression, logistic_regression Notes ----- Mediation analysis [1]_ is a *"statistical procedure to test whether the effect of an independent variable X on a dependent variable Y (i.e., X → Y) is at least partly explained by a chain of effects of the independent variable on an intervening mediator variable M and of the intervening variable on the dependent variable (i.e., X → M → Y)"* [2]_. The **indirect effect** (also referred to as average causal mediation effect or ACME) of X on Y through mediator M quantifies the estimated difference in Y resulting from a one-unit change in X through a sequence of causal steps in which X affects M, which in turn affects Y. It is considered significant if the specified confidence interval does not include 0. The path 'X --> Y' is the sum of both the indirect and direct effect. It is sometimes referred to as total effect. A linear regression is used if the mediator variable is continuous and a logistic regression if the mediator variable is dichotomous (binary). Multiple parallel mediators are also supported. This function wll only work well if the outcome variable is continuous. It does not support binary or ordinal outcome variable. For more advanced mediation models, please refer to the `lavaan <http://lavaan.ugent.be/tutorial/mediation.html>`_ or `mediation <https://cran.r-project.org/web/packages/mediation/mediation.pdf>`_ R packages, or the `PROCESS macro <https://www.processmacro.org/index.html>`_ for SPSS. The two-sided p-value of the indirect effect is computed using the bootstrap distribution, as in the mediation R package. However, the p-value should be interpreted with caution since it is not constructed conditioned on a true null hypothesis [3]_ and varies depending on the number of bootstrap samples and the random seed. Note that rows with missing values are automatically removed. Results have been tested against the R mediation package and this tutorial https://data.library.virginia.edu/introduction-to-mediation-analysis/ References ---------- .. [1] Baron, R. M. & Kenny, D. A. The moderator–mediator variable distinction in social psychological research: Conceptual, strategic, and statistical considerations. J. Pers. Soc. Psychol. 51, 1173–1182 (1986). .. [2] Fiedler, K., Schott, M. & Meiser, T. What mediation analysis can (not) do. J. Exp. Soc. Psychol. 47, 1231–1236 (2011). .. [3] Hayes, A. F. & Rockwood, N. J. Regression-based statistical mediation and moderation analysis in clinical research: Observations, recommendations, and implementation. Behav. Res. Ther. 98, 39–57 (2017). Code originally adapted from https://github.com/rmill040/pymediation. Examples -------- 1. Simple mediation analysis >>> from pingouin import mediation_analysis, read_dataset >>> df = read_dataset('mediation') >>> mediation_analysis(data=df, x='X', m='M', y='Y', alpha=0.05, ... seed=42) path coef se pval CI[2.5%] CI[97.5%] sig 0 M ~ X 0.561015 0.094480 4.391362e-08 0.373522 0.748509 Yes 1 Y ~ M 0.654173 0.085831 1.612674e-11 0.483844 0.824501 Yes 2 Total 0.396126 0.111160 5.671128e-04 0.175533 0.616719 Yes 3 Direct 0.039604 0.109648 7.187429e-01 -0.178018 0.257226 No 4 Indirect 0.356522 0.083313 0.000000e+00 0.219818 0.537654 Yes 2. Return the indirect bootstrapped beta coefficients >>> stats, dist = mediation_analysis(data=df, x='X', m='M', y='Y', ... return_dist=True) >>> print(dist.shape) (500,) 3. Mediation analysis with a binary mediator variable >>> mediation_analysis(data=df, x='X', m='Mbin', y='Y', seed=42).round(3) path coef se pval CI[2.5%] CI[97.5%] sig 0 Mbin ~ X -0.021 0.116 0.857 -0.248 0.206 No 1 Y ~ Mbin -0.135 0.412 0.743 -0.952 0.682 No 2 Total 0.396 0.111 0.001 0.176 0.617 Yes 3 Direct 0.396 0.112 0.001 0.174 0.617 Yes 4 Indirect 0.002 0.050 0.960 -0.072 0.146 No 4. Mediation analysis with covariates >>> mediation_analysis(data=df, x='X', m='M', y='Y', ... covar=['Mbin', 'Ybin'], seed=42).round(3) path coef se pval CI[2.5%] CI[97.5%] sig 0 M ~ X 0.559 0.097 0.000 0.367 0.752 Yes 1 Y ~ M 0.666 0.086 0.000 0.495 0.837 Yes 2 Total 0.420 0.113 0.000 0.196 0.645 Yes 3 Direct 0.064 0.110 0.561 -0.155 0.284 No 4 Indirect 0.356 0.086 0.000 0.209 0.553 Yes 5. Mediation analysis with multiple parallel mediators >>> mediation_analysis(data=df, x='X', m=['M', 'Mbin'], y='Y', ... seed=42).round(3) path coef se pval CI[2.5%] CI[97.5%] sig 0 M ~ X 0.561 0.094 0.000 0.374 0.749 Yes 1 Mbin ~ X -0.005 0.029 0.859 -0.063 0.052 No 2 Y ~ M 0.654 0.086 0.000 0.482 0.825 Yes 3 Y ~ Mbin -0.064 0.328 0.846 -0.715 0.587 No 4 Total 0.396 0.111 0.001 0.176 0.617 Yes 5 Direct 0.040 0.110 0.721 -0.179 0.258 No 6 Indirect M 0.356 0.085 0.000 0.215 0.538 Yes 7 Indirect Mbin 0.000 0.010 0.952 -0.017 0.025 No """ # Sanity check assert isinstance(x, str), 'y must be a string.' assert isinstance(y, str), 'y must be a string.' assert isinstance(m, (list, str)), 'Mediator(s) must be a list or string.' assert isinstance(covar, (type(None), str, list)) if isinstance(m, str): m = [m] n_mediator = len(m) assert isinstance(data, pd.DataFrame), 'Data must be a DataFrame.' # Check for duplicates assert n_mediator == len(set(m)), 'Cannot have duplicates mediators.' if isinstance(covar, str): covar = [covar] if isinstance(covar, list): assert len(covar) == len(set(covar)), 'Cannot have duplicates covar.' assert set(m).isdisjoint(covar), 'Mediator cannot be in covar.' # Check that columns are in dataframe columns = _fl([x, m, y, covar]) keys = data.columns assert all([c in keys for c in columns]), 'Column(s) are not in DataFrame.' # Check that columns are numeric err_msg = "Columns must be numeric or boolean." assert all([data[c].dtype.kind in 'bfiu' for c in columns]), err_msg # Drop rows with NAN Values data = data[columns].dropna() n = data.shape[0] assert n > 5, 'DataFrame must have at least 5 samples (rows).' # Check if mediator is binary mtype = 'logistic' if all(data[m].nunique() == 2) else 'linear' # Name of CI ll_name = 'CI[%.1f%%]' % (100 * alpha / 2) ul_name = 'CI[%.1f%%]' % (100 * (1 - alpha / 2)) # Compute regressions cols = ['names', 'coef', 'se', 'pval', ll_name, ul_name] # For speed, we pass np.array instead of pandas DataFrame X_val = data[_fl([x, covar])].to_numpy() # X + covar as predictors XM_val = data[_fl([x, m, covar])].to_numpy() # X + M + covar as predictors M_val = data[m].to_numpy() # M as target (no covariates) y_val = data[y].to_numpy() # y as target (no covariates) # For max precision, make sure rounding is disabled old_options = options.copy() options['round'] = None # M(j) ~ X + covar sxm = {} for idx, j in enumerate(m): if mtype == 'linear': sxm[j] = linear_regression(X_val, M_val[:, idx], alpha=alpha).loc[[1], cols] else: sxm[j] = logistic_regression(X_val, M_val[:, idx], alpha=alpha).loc[[1], cols] sxm[j].at[1, 'names'] = '%s ~ X' % j sxm = pd.concat(sxm, ignore_index=True) # Y ~ M + covar smy = linear_regression(data[_fl([m, covar])], y_val, alpha=alpha).loc[1:n_mediator, cols] # Average Total Effects (Y ~ X + covar) sxy = linear_regression(X_val, y_val, alpha=alpha).loc[[1], cols] # Average Direct Effects (Y ~ X + M + covar) direct = linear_regression(XM_val, y_val, alpha=alpha).loc[[1], cols] # Rename paths smy['names'] = smy['names'].apply(lambda x: 'Y ~ %s' % x) direct.at[1, 'names'] = 'Direct' sxy.at[1, 'names'] = 'Total' # Concatenate and create sig column stats = pd.concat((sxm, smy, sxy, direct), ignore_index=True) stats['sig'] = np.where(stats['pval'] < alpha, 'Yes', 'No') # Bootstrap confidence intervals rng = np.random.RandomState(seed) idx = rng.choice(np.arange(n), replace=True, size=(n_boot, n)) ab_estimates = np.zeros(shape=(n_boot, n_mediator)) for i in range(n_boot): ab_estimates[i, :] = _point_estimate(X_val, XM_val, M_val, y_val, idx[i, :], n_mediator, mtype) ab = _point_estimate(X_val, XM_val, M_val, y_val, np.arange(n), n_mediator, mtype) indirect = {'names': m, 'coef': ab, 'se': ab_estimates.std(ddof=1, axis=0), 'pval': [], ll_name: [], ul_name: [], 'sig': []} for j in range(n_mediator): ci_j = _bca(ab_estimates[:, j], indirect['coef'][j], alpha=alpha, n_boot=n_boot) indirect[ll_name].append(min(ci_j)) indirect[ul_name].append(max(ci_j)) # Bootstrapped p-value of indirect effect # Note that this is less accurate than a permutation test because the # bootstrap distribution is not conditioned on a true null hypothesis. # For more details see Hayes and Rockwood 2017 indirect['pval'].append(_pval_from_bootci(ab_estimates[:, j], indirect['coef'][j])) indirect['sig'].append('Yes' if indirect['pval'][j] < alpha else 'No') # Create output dataframe indirect = pd.DataFrame.from_dict(indirect) if n_mediator == 1: indirect['names'] = 'Indirect' else: indirect['names'] = indirect['names'].apply(lambda x: 'Indirect %s' % x) stats = stats.append(indirect, ignore_index=True) stats = stats.rename(columns={'names': 'path'}) # Restore options options.update(old_options) if return_dist: return _postprocess_dataframe(stats), np.squeeze(ab_estimates) else: return _postprocess_dataframe(stats)
def rm_corr(data=None, x=None, y=None, subject=None, tail='two-sided'): """Repeated measures correlation. Parameters ---------- data : :py:class:`pandas.DataFrame` Dataframe. x, y : string Name of columns in ``data`` containing the two dependent variables. subject : string Name of column in ``data`` containing the subject indicator. tail : string Specify whether to return 'one-sided' or 'two-sided' p-value. Returns ------- stats : :py:class:`pandas.DataFrame` * ``'r'``: Repeated measures correlation coefficient * ``'dof'``: Degrees of freedom * ``'pval'``: one or two tailed p-value * ``'CI95'``: 95% parametric confidence intervals * ``'power'``: achieved power of the test (= 1 - type II error). See also -------- plot_rm_corr Notes ----- Repeated measures correlation (rmcorr) is a statistical technique for determining the common within-individual association for paired measures assessed on two or more occasions for multiple individuals. From `Bakdash and Marusich (2017) <https://doi.org/10.3389/fpsyg.2017.00456>`_: *Rmcorr accounts for non-independence among observations using analysis of covariance (ANCOVA) to statistically adjust for inter-individual variability. By removing measured variance between-participants, rmcorr provides the best linear fit for each participant using parallel regression lines (the same slope) with varying intercepts. Like a Pearson correlation coefficient, the rmcorr coefficient is bounded by − 1 to 1 and represents the strength of the linear association between two variables.* Results have been tested against the `rmcorr <https://github.com/cran/rmcorr>`_ R package. Please note that missing values are automatically removed from the dataframe (listwise deletion). Examples -------- >>> import pingouin as pg >>> df = pg.read_dataset('rm_corr') >>> pg.rm_corr(data=df, x='pH', y='PacO2', subject='Subject') r dof pval CI95% power rm_corr -0.50677 38 0.000847 [-0.71, -0.23] 0.929579 Now plot using the :py:func:`pingouin.plot_rm_corr` function: .. plot:: >>> import pingouin as pg >>> df = pg.read_dataset('rm_corr') >>> g = pg.plot_rm_corr(data=df, x='pH', y='PacO2', subject='Subject') """ from pingouin import ancova, power_corr # Safety checks assert isinstance(data, pd.DataFrame), 'Data must be a DataFrame' assert x in data.columns, 'The %s column is not in data.' % x assert y in data.columns, 'The %s column is not in data.' % y assert data[x].dtype.kind in 'bfiu', '%s must be numeric.' % x assert data[y].dtype.kind in 'bfiu', '%s must be numeric.' % y assert subject in data.columns, 'The %s column is not in data.' % subject if data[subject].nunique() < 3: raise ValueError('rm_corr requires at least 3 unique subjects.') # Remove missing values data = data[[x, y, subject]].dropna(axis=0) # Using PINGOUIN # For max precision, make sure rounding is disabled old_options = options.copy() options['round'] = None aov = ancova(dv=y, covar=x, between=subject, data=data) options.update(old_options) # restore options bw = aov.bw_ # Beta within parameter sign = np.sign(bw) dof = int(aov.at[2, 'DF']) n = dof + 2 ssfactor = aov.at[1, 'SS'] sserror = aov.at[2, 'SS'] rm = sign * np.sqrt(ssfactor / (ssfactor + sserror)) pval = aov.at[1, 'p-unc'] pval = pval * 0.5 if tail == 'one-sided' else pval ci = compute_esci(stat=rm, nx=n, eftype='pearson').tolist() pwr = power_corr(r=rm, n=n, tail=tail) # Convert to Dataframe stats = pd.DataFrame({"r": rm, "dof": int(dof), "pval": pval, "CI95%": [ci], "power": pwr}, index=["rm_corr"]) return _postprocess_dataframe(stats)
def intraclass_corr(data=None, targets=None, raters=None, ratings=None, nan_policy='raise'): """Intraclass correlation. Parameters ---------- data : :py:class:`pandas.DataFrame` Long-format dataframe. Data must be fully balanced. targets : string Name of column in ``data`` containing the targets. raters : string Name of column in ``data`` containing the raters. ratings : string Name of column in ``data`` containing the ratings. nan_policy : str Defines how to handle when input contains missing values (nan). `'raise'` (default) throws an error, `'omit'` performs the calculations after deleting target(s) with one or more missing values (= listwise deletion). .. versionadded:: 0.3.0 Returns ------- stats : :py:class:`pandas.DataFrame` Output dataframe: * ``'Type'``: ICC type * ``'Description'``: description of the ICC * ``'ICC'``: intraclass correlation * ``'F'``: F statistic * ``'df1'``: numerator degree of freedom * ``'df2'``: denominator degree of freedom * ``'pval'``: p-value * ``'CI95%'``: 95% confidence intervals around the ICC Notes ----- The intraclass correlation (ICC, [1]_) assesses the reliability of ratings by comparing the variability of different ratings of the same subject to the total variation across all ratings and all subjects. Shrout and Fleiss (1979) [2]_ describe six cases of reliability of ratings done by :math:`k` raters on :math:`n` targets. Pingouin returns all six cases with corresponding F and p-values, as well as 95% confidence intervals. From the documentation of the ICC function in the `psych <https://cran.r-project.org/web/packages/psych/psych.pdf>`_ R package: - **ICC1**: Each target is rated by a different rater and the raters are selected at random. This is a one-way ANOVA fixed effects model. - **ICC2**: A random sample of :math:`k` raters rate each target. The measure is one of absolute agreement in the ratings. ICC1 is sensitive to differences in means between raters and is a measure of absolute agreement. - **ICC3**: A fixed set of :math:`k` raters rate each target. There is no generalization to a larger population of raters. ICC2 and ICC3 remove mean differences between raters, but are sensitive to interactions. The difference between ICC2 and ICC3 is whether raters are seen as fixed or random effects. Then, for each of these cases, the reliability can either be estimated for a single rating or for the average of :math:`k` ratings. The 1 rating case is equivalent to the average intercorrelation, while the :math:`k` rating case is equivalent to the Spearman Brown adjusted reliability. **ICC1k**, **ICC2k**, **ICC3K** reflect the means of :math:`k` raters. This function has been tested against the ICC function of the R psych package. Note however that contrarily to the R implementation, the current implementation does not use linear mixed effect but regular ANOVA, which means that it only works with complete-case data (no missing values). References ---------- .. [1] http://www.real-statistics.com/reliability/intraclass-correlation/ .. [2] Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: uses in assessing rater reliability. Psychological bulletin, 86(2), 420. Examples -------- ICCs of wine quality assessed by 4 judges. >>> import pingouin as pg >>> data = pg.read_dataset('icc') >>> icc = pg.intraclass_corr(data=data, targets='Wine', raters='Judge', ... ratings='Scores').round(3) >>> icc.set_index("Type") Description ICC F df1 df2 pval CI95% Type ICC1 Single raters absolute 0.728 11.680 7 24 0.0 [0.43, 0.93] ICC2 Single random raters 0.728 11.787 7 21 0.0 [0.43, 0.93] ICC3 Single fixed raters 0.729 11.787 7 21 0.0 [0.43, 0.93] ICC1k Average raters absolute 0.914 11.680 7 24 0.0 [0.75, 0.98] ICC2k Average random raters 0.914 11.787 7 21 0.0 [0.75, 0.98] ICC3k Average fixed raters 0.915 11.787 7 21 0.0 [0.75, 0.98] """ from pingouin import anova # Safety check assert isinstance(data, pd.DataFrame), 'data must be a dataframe.' assert all([v is not None for v in [targets, raters, ratings]]) assert all([v in data.columns for v in [targets, raters, ratings]]) assert nan_policy in ['omit', 'raise'] # Convert data to wide-format data = data.pivot_table(index=targets, columns=raters, values=ratings) # Listwise deletion of missing values nan_present = data.isna().any().any() if nan_present: if nan_policy == 'omit': data = data.dropna(axis=0, how='any') else: raise ValueError("Either missing values are present in data or " "data are unbalanced. Please remove them " "manually or use nan_policy='omit'.") # Back to long-format # data_wide = data.copy() # Optional, for PCA data = data.reset_index().melt(id_vars=targets, value_name=ratings) # Check that ratings is a numeric variable assert data[ratings].dtype.kind in 'bfiu', 'Ratings must be numeric.' # Check that data are fully balanced # This behavior is ensured by the long-to-wide-to-long transformation # Unbalanced data will result in rows with missing values. # assert data.groupby(raters)[ratings].count().nunique() == 1 # Extract sizes k = data[raters].nunique() n = data[targets].nunique() # Two-way ANOVA with np.errstate(invalid='ignore'): # For max precision, make sure rounding is disabled old_options = options.copy() options['round'] = None aov = anova(data=data, dv=ratings, between=[targets, raters], ss_type=2) options.update(old_options) # restore options # Extract mean squares msb = aov.at[0, 'MS'] msw = (aov.at[1, 'SS'] + aov.at[2, 'SS']) / (aov.at[1, 'DF'] + aov.at[2, 'DF']) msj = aov.at[1, 'MS'] mse = aov.at[2, 'MS'] # Calculate ICCs icc1 = (msb - msw) / (msb + (k - 1) * msw) icc2 = (msb - mse) / (msb + (k - 1) * mse + k * (msj - mse) / n) icc3 = (msb - mse) / (msb + (k - 1) * mse) icc1k = (msb - msw) / msb icc2k = (msb - mse) / (msb + (msj - mse) / n) icc3k = (msb - mse) / msb # Calculate F, df, and p-values f1k = msb / msw df1 = n - 1 df1kd = n * (k - 1) p1k = f.sf(f1k, df1, df1kd) f2k = f3k = msb / mse df2kd = (n - 1) * (k - 1) p2k = f.sf(f2k, df1, df2kd) # Create output dataframe stats = { 'Type': ['ICC1', 'ICC2', 'ICC3', 'ICC1k', 'ICC2k', 'ICC3k'], 'Description': [ 'Single raters absolute', 'Single random raters', 'Single fixed raters', 'Average raters absolute', 'Average random raters', 'Average fixed raters' ], 'ICC': [icc1, icc2, icc3, icc1k, icc2k, icc3k], 'F': [f1k, f2k, f2k, f1k, f2k, f2k], 'df1': n - 1, 'df2': [df1kd, df2kd, df2kd, df1kd, df2kd, df2kd], 'pval': [p1k, p2k, p2k, p1k, p2k, p2k] } stats = pd.DataFrame(stats) # Calculate confidence intervals alpha = 0.05 # Case 1 and 3 f1l = f1k / f.ppf(1 - alpha / 2, df1, df1kd) f1u = f1k * f.ppf(1 - alpha / 2, df1kd, df1) l1 = (f1l - 1) / (f1l + (k - 1)) u1 = (f1u - 1) / (f1u + (k - 1)) f3l = f3k / f.ppf(1 - alpha / 2, df1, df2kd) f3u = f3k * f.ppf(1 - alpha / 2, df2kd, df1) l3 = (f3l - 1) / (f3l + (k - 1)) u3 = (f3u - 1) / (f3u + (k - 1)) # Case 2 fj = msj / mse vn = df2kd * ((k * icc2 * fj + n * (1 + (k - 1) * icc2) - k * icc2))**2 vd = df1 * k**2 * icc2**2 * fj**2 + \ (n * (1 + (k - 1) * icc2) - k * icc2)**2 v = vn / vd f2u = f.ppf(1 - alpha / 2, n - 1, v) f2l = f.ppf(1 - alpha / 2, v, n - 1) l2 = n * (msb - f2u * mse) / (f2u * (k * msj + (k * n - k - n) * mse) + n * msb) u2 = n * (f2l * msb - mse) / (k * msj + (k * n - k - n) * mse + n * f2l * msb) stats['CI95%'] = [ np.array([l1, u1]), np.array([l2, u2]), np.array([l3, u3]), np.array([1 - 1 / f1l, 1 - 1 / f1u]), np.array([l2 * k / (1 + l2 * (k - 1)), u2 * k / (1 + u2 * (k - 1))]), np.array([1 - 1 / f3l, 1 - 1 / f3u]) ] return _postprocess_dataframe(stats)