The Python `SomaData` Package from Somalogic, Inc.

Overview

This document accompanies the Python package somadata, which loads the SomaLogic, Inc. structured text data file called an *.adat. The somadata.Adat object is an extension of the pandas.DataFrame class. The package provides auxiliary functions for extracting relevant information from the ADAT object once in the Python environment. Basic familiarity with the Python environment is assumed, as is the ability to install contributed packages from the Python Package Installer (pip)

Installation

The easiest way to install SomaData is to install directly from PyPI

PIP:

pip install SomaData

Alternatively one can install from the GitHub repository.

GitHub:

pip install git+https://github.com/SomaLogic/Canopy.git#egg=somadata

Alternatively, if you wish to develop or change the source code, you may clone the repository and install manually via:

git clone https://github.com/SomaLogic/Canopy.git
pip install -e ./somadata

Dependencies

Python >=3.8 is required to install somadata. The following package dependencies are installed on a pip install:

pandas >= 1.1.0
numpy >= 1.19.1

return to top

Basics

Upon installation, load somadata as normal:

return to top

import somadata

For a traversable index of the library:

help(somadata)
# help(somadata.adat) ... etc

Help on package somadata:

NAME
    somadata

PACKAGE CONTENTS
    adat
    annotations
    base (package)
    data (package)
    errors
    io (package)
    tools (package)

FILE
    /Users/tjohnson/code/repos/SomaData/somadata/__init__.py

Internal Objects

The somadata package comes with one internal object available to users to run canned examples (or analyses). It can be accessed by perform the import:

from somadata.data.example_data import example_data

Main Features (I/O)

Loading data (Import)
- Import a text file in the *.adat format into a Python session as an adat object.
Wrangling data (Manipulation)
- Subset, reorder, and list various fields of an adat object.
Exporting data (Output)
- Write out an adat object as a *.adat text file.

Loading an ADAT

return to top

Loading the sample file from within the somadata library via its path

adat = somadata.read_adat('./somadata/data/example_data.adat')
type(adat)

somadata.adat.Adat

adat.shape

(192, 5284)

adat.columns

MultiIndex([( '10000-28', '3', 'SL019233', ...),
            (  '10001-7', '3', 'SL002564', ...),
            ( '10003-15', '3', 'SL019245', ...),
            ( '10006-25', '3', 'SL019228', ...),
            ( '10008-43', '3', 'SL019234', ...),
            ( '10011-65', '3', 'SL019246', ...),
            (  '10012-5', '3', 'SL014669', ...),
            ( '10013-34', '3', 'SL025418', ...),
            ( '10014-31', '3', 'SL007803', ...),
            ('10015-119', '3', 'SL014924', ...),
            ...
            (  '9981-18', '3', 'SL018293', ...),
            (  '9983-97', '3', 'SL019202', ...),
            (  '9984-12', '3', 'SL019205', ...),
            (  '9986-14', '3', 'SL005356', ...),
            (  '9989-12', '3', 'SL019194', ...),
            (  '9993-11', '3', 'SL019212', ...),
            ( '9994-217', '3', 'SL019217', ...),
            (   '9995-6', '3', 'SL013164', ...),
            (  '9997-12', '3', 'SL019215', ...),
            (   '9999-1', '3', 'SL019231', ...)],
           names=['SeqId', 'SeqIdVersion', 'SomaId', 'TargetFullName', 'Target', 'UniProt', 'EntrezGeneID', 'EntrezGeneSymbol', 'Organism', 'Units', 'Type', 'Dilution', 'PlateScale_Reference', 'CalReference', 'Cal_Example_Adat_Set001', 'ColCheck', 'CalQcRatio_Example_Adat_Set001_170255', 'QcReference_170255', 'Cal_Example_Adat_Set002', 'CalQcRatio_Example_Adat_Set002_170255'], length=5284)

from IPython.display import HTML
#Display the first five rows and columns of the adat
HTML(adat.iloc[:5,:5].to_html()) # Need to use HTML & to_html() here to display nicely for this README
# Output is left-right scrollable in both this readme and Jupyter notebooks

																																	SeqId	10000-28	10001-7	10003-15	10006-25	10008-43
																																	SeqIdVersion	3	3	3	3	3
																																	SomaId	SL019233	SL002564	SL019245	SL019228	SL019234
																																	TargetFullName	Beta-crystallin B2	RAF proto-oncogene serine/threonine-protein kinase	Zinc finger protein 41	ETS domain-containing protein Elk-1	Guanylyl cyclase-activating protein 1
																																	Target	CRBB2	c-Raf	ZNF41	ELK1	GUC1A
																																	UniProt	P43320	P04049	P51814	P19419	P43080
																																	EntrezGeneID	1415	5894	7592	2002	2978
																																	EntrezGeneSymbol	CRYBB2	RAF1	ZNF41	ELK1	GUCA1A
																																	Organism	Human	Human	Human	Human	Human
																																	Units	RFU	RFU	RFU	RFU	RFU
																																	Type	Protein	Protein	Protein	Protein	Protein
																																	Dilution	20	20	0.5	20	20
																																	PlateScale_Reference	687.4	227.8	126.9	634.2	585.0
																																	CalReference	687.4	227.8	126.9	634.2	585.0
																																	Cal_Example_Adat_Set001	1.01252025	1.01605709	0.95056180	0.99607350	0.94051447
																																	ColCheck	PASS	PASS	PASS	PASS	PASS
																																	CalQcRatio_Example_Adat_Set001_170255	1.008	0.970	1.046	1.042	1.036
																																	QcReference_170255	505.4	223.9	119.6	667.2	587.5
																																	Cal_Example_Adat_Set002	1.01476233	1.03686846	1.15258856	0.93581231	0.96201283
																																	CalQcRatio_Example_Adat_Set002_170255	1.067	1.007	0.981	1.026	0.998
PlateId	PlateRunDate	ScannerID	PlatePosition	SlideId	Subarray	SampleId	SampleType	PercentDilution	SampleMatrix	Barcode	Barcode2d	SampleName	SampleNotes	AliquotingNotes	SampleDescription	AssayNotes	TimePoint	ExtIdentifier	SsfExtId	SampleGroup	SiteId	TubeUniqueID	CLI	HybControlNormScale	RowCheck	NormScale_20	NormScale_0_005	NormScale_0_5	ANMLFractionUsed_20	ANMLFractionUsed_0_005	ANMLFractionUsed_0_5	Age	Sex
Example Adat Set001	2020-06-18	SG15214400	H9	258495800012	3	1	Sample	20	Plasma-PPT															0.98185998	PASS	1.03693580	0.85701624	0.77717491	0.914	0.869	0.903	76	F	476.5	310.1	100.3	602.8	561.8
			H8	258495800004	7	2	Sample	20	Plasma-PPT															0.96671829	PASS	0.96022505	0.84858420	0.85201953	0.937	0.956	0.973	55	F	474.4	293.5	101.8	561.9	541.9
			H7	258495800010	8	3	Sample	20	Plasma-PPT															1.00193072	PASS	0.98411617	1.03270156	0.91519153	0.907	0.919	0.915	47	M	415.6	299.6	3030.1	563.9	423.9
			H6	258495800003	4	4	Sample	20	Plasma-PPT															0.94017961	PASS	1.07839878	0.94626841	0.91246731	0.934	0.919	0.912	37	M	442.6	247.9	112.9	563.7	469.8
			H5	258495800009	4	5	Sample	20	Plasma-PPT															0.94621098	PASS	0.84679446	0.92904553	0.77413056	0.707	0.894	0.708	71	F	465.7	710.7	95.9	791.0	443.5

You may also access the dict header metadata via:

adat.header_metadata

{'!AdatId': 'GID-1234-56-789-abcdef',
 '!Version': '1.2',
 '!AssayType': 'PharmaServices',
 '!AssayVersion': 'V4',
 '!AssayRobot': 'Fluent 1 L-307',
 '!Legal': 'Experiment details and data have been processed to protect Personally Identifiable Information (PII) and comply with existing privacy laws.',
 '!CreatedBy': 'PharmaServices',
 '!CreatedDate': '2020-07-24',
 '!EnteredBy': 'Technician1',
 '!ExpDate': '2020-06-18, 2020-07-20',
 '!GeneratedBy': 'Px (Build:  : ), Canopy_0.1.1',
 '!RunNotes': "2 columns ('Age' and 'Sex') have been added to this ADAT. Age has been randomly increased or decreased by 1-2 years to protect patient information",
 '!ProcessSteps': 'Raw RFU, Hyb Normalization, medNormInt (SampleId), plateScale, Calibration, anmlQC, qcCheck, anmlSMP',
 '!ProteinEffectiveDate': '2019-08-06',
 '!StudyMatrix': 'EDTA Plasma',
 '!PlateType': '',
 '!LabLocation': 'SLUS',
 '!StudyOrganism': '',
 '!Title': 'Example Adat Set001, Example Adat Set002',
 '!AssaySite': 'SW',
 '!CalibratorId': '170261',
 '!ReportConfig': {'analysisSteps': [{'stepType': 'hybNorm',
    'referenceSource': 'intraplate',
    'includeSampleTypes': ['QC', 'Calibrator', 'Buffer']},
   {'stepName': 'medNormInt',
    'stepType': 'medNorm',
    'includeSampleTypes': ['Calibrator', 'Buffer'],
    'referenceSource': 'intraplate',
    'referenceFields': ['SampleId']},
   {'stepType': 'plateScale',
    'referenceSource': 'Reference_v4_Plasma_Calibrator_170261'},
   {'stepType': 'calibrate',
    'referenceSource': 'Reference_v4_Plasma_Calibrator_170261'},
   {'stepName': 'anmlQC',
    'stepType': 'ANML',
    'effectSizeCutoff': 2.0,
    'minFractionUsed': 0.3,
    'includeSampleTypes': ['QC'],
    'referenceSource': 'Reference_v4_Plasma_ANML'},
   {'stepType': 'qcCheck',
    'QCReferenceSource': 'Reference_v4_Plasma_QC_ANML_170255',
    'tailsCriteriaLower': 0.8,
    'tailsCriteriaUpper': 1.2,
    'tailThreshold': 15.0,
    'QCAdditionalReferenceSources': ['Reference_v4_Plasma_QC_ANML_170259',
     'Reference_v4_Plasma_QC_ANML_170260'],
    'prenormalized': True},
   {'stepName': 'anmlSMP',
    'stepType': 'ANML',
    'effectSizeCutoff': 2.0,
    'minFractionUsed': 0.3,
    'includeSampleTypes': ['Sample'],
    'referenceSource': 'Reference_v4_Plasma_ANML'}],
  'qualityReports': ['SQS Report'],
  'filter': {'proteinEffectiveDate': '2019-08-06'}},
 'HybNormReference': 'intraplate',
 'MedNormReference': 'intraplate',
 'NormalizationAlgorithm': 'ANML',
 'PlateScale_ReferenceSource': 'Reference_v4_Plasma_Calibrator_170261',
 'PlateScale_Scalar_Example_Adat_Set001': '1.08091554',
 'PlateScale_PassFlag_Example_Adat_Set001': 'PASS',
 'CalibrationReference': 'Reference_v4_Plasma_Calibrator_170261',
 'CalPlateTailPercent_Example_Adat_Set001': '0.1',
 'PlateTailPercent_Example_Adat_Set001': '1.2',
 'PlateTailTest_Example_Adat_Set001': 'PASS',
 'PlateScale_Scalar_Example_Adat_Set002': '1.09915270',
 'PlateScale_PassFlag_Example_Adat_Set002': 'PASS',
 'CalPlateTailPercent_Example_Adat_Set002': '2.6',
 'PlateTailPercent_Example_Adat_Set002': '4.2',
 'PlateTailTest_Example_Adat_Set002': 'PASS'}

SomaData's Adat object inherits the pandas printing methods which displays nicely in Jupyter Notebooks when using IPython.display.display().

Wrangling

Dataframe columns Contain Feature Information

return to top

# Using the `adat` loaded from above
aptamer_df = adat.columns.to_frame(index=False)
type(aptamer_df)

pandas.core.frame.DataFrame

HTML(aptamer_df.head(5).to_html())

	SeqId	SeqIdVersion	SomaId	TargetFullName	Target	UniProt	EntrezGeneID	EntrezGeneSymbol	Organism	Units	Type	Dilution	PlateScale_Reference	CalReference	Cal_Example_Adat_Set001	ColCheck	CalQcRatio_Example_Adat_Set001_170255	QcReference_170255	Cal_Example_Adat_Set002	CalQcRatio_Example_Adat_Set002_170255
0	10000-28	3	SL019233	Beta-crystallin B2	CRBB2	P43320	1415	CRYBB2	Human	RFU	Protein	20	687.4	687.4	1.01252025	PASS	1.008	505.4	1.01476233	1.067
1	10001-7	3	SL002564	RAF proto-oncogene serine/threonine-protein kinase	c-Raf	P04049	5894	RAF1	Human	RFU	Protein	20	227.8	227.8	1.01605709	PASS	0.970	223.9	1.03686846	1.007
2	10003-15	3	SL019245	Zinc finger protein 41	ZNF41	P51814	7592	ZNF41	Human	RFU	Protein	0.5	126.9	126.9	0.95056180	PASS	1.046	119.6	1.15258856	0.981
3	10006-25	3	SL019228	ETS domain-containing protein Elk-1	ELK1	P19419	2002	ELK1	Human	RFU	Protein	20	634.2	634.2	0.99607350	PASS	1.042	667.2	0.93581231	1.026
4	10008-43	3	SL019234	Guanylyl cyclase-activating protein 1	GUC1A	P43080	2978	GUCA1A	Human	RFU	Protein	20	585.0	585.0	0.94051447	PASS	1.036	587.5	0.96201283	0.998

Accessing feature data

The .to_frame() method creates a lookup table that links the feature names in the adat object to the annotation data in columns:

col_df = adat.columns.to_frame(index=False)
type(col_df)

pandas.core.frame.DataFrame

HTML(col_df.head(5).to_html())

	SeqId	SeqIdVersion	SomaId	TargetFullName	Target	UniProt	EntrezGeneID	EntrezGeneSymbol	Organism	Units	Type	Dilution	PlateScale_Reference	CalReference	Cal_Example_Adat_Set001	ColCheck	CalQcRatio_Example_Adat_Set001_170255	QcReference_170255	Cal_Example_Adat_Set002	CalQcRatio_Example_Adat_Set002_170255
0	10000-28	3	SL019233	Beta-crystallin B2	CRBB2	P43320	1415	CRYBB2	Human	RFU	Protein	20	687.4	687.4	1.01252025	PASS	1.008	505.4	1.01476233	1.067
1	10001-7	3	SL002564	RAF proto-oncogene serine/threonine-protein kinase	c-Raf	P04049	5894	RAF1	Human	RFU	Protein	20	227.8	227.8	1.01605709	PASS	0.970	223.9	1.03686846	1.007
2	10003-15	3	SL019245	Zinc finger protein 41	ZNF41	P51814	7592	ZNF41	Human	RFU	Protein	0.5	126.9	126.9	0.95056180	PASS	1.046	119.6	1.15258856	0.981
3	10006-25	3	SL019228	ETS domain-containing protein Elk-1	ELK1	P19419	2002	ELK1	Human	RFU	Protein	20	634.2	634.2	0.99607350	PASS	1.042	667.2	0.93581231	1.026
4	10008-43	3	SL019234	Guanylyl cyclase-activating protein 1	GUC1A	P43080	2978	GUCA1A	Human	RFU	Protein	20	585.0	585.0	0.94051447	PASS	1.036	587.5	0.96201283	0.998

Display features

adat.columns.get_level_values('SeqId')[:20] # first 20 features

Index(['10000-28', '10001-7', '10003-15', '10006-25', '10008-43', '10011-65',
       '10012-5', '10013-34', '10014-31', '10015-119', '10021-1', '10022-207',
       '10023-32', '10024-44', '10030-8', '10034-16', '10035-6', '10036-201',
       '10037-98', '10040-63'],
      dtype='object', name='SeqId')

Get # Features

adat.shape[1]

Clinical Data

Dataframe index Contains Sample Information

# Using the `adat` loaded from above
sample_df = adat.index.to_frame(index=False)
type(sample_df)

pandas.core.frame.DataFrame

HTML(sample_df.head(5).to_html())

	PlateId	PlateRunDate	ScannerID	PlatePosition	SlideId	Subarray	SampleId	SampleType	PercentDilution	SampleMatrix	HybControlNormScale	RowCheck	NormScale_20	NormScale_0_005	NormScale_0_5	ANMLFractionUsed_20	ANMLFractionUsed_0_005	ANMLFractionUsed_0_5	Age	Sex
0	Example Adat Set001	2020-06-18	SG15214400	H9	258495800012	3	1	Sample	20	Plasma-PPT	0.98185998	PASS	1.03693580	0.85701624	0.77717491	0.914	0.869	0.903	76	F
1	Example Adat Set001	2020-06-18	SG15214400	H8	258495800004	7	2	Sample	20	Plasma-PPT	0.96671829	PASS	0.96022505	0.84858420	0.85201953	0.937	0.956	0.973	55	F
2	Example Adat Set001	2020-06-18	SG15214400	H7	258495800010	8	3	Sample	20	Plasma-PPT	1.00193072	PASS	0.98411617	1.03270156	0.91519153	0.907	0.919	0.915	47	M
3	Example Adat Set001	2020-06-18	SG15214400	H6	258495800003	4	4	Sample	20	Plasma-PPT	0.94017961	PASS	1.07839878	0.94626841	0.91246731	0.934	0.919	0.912	37	M
4	Example Adat Set001	2020-06-18	SG15214400	H5	258495800009	4	5	Sample	20	Plasma-PPT	0.94621098	PASS	0.84679446	0.92904553	0.77413056	0.707	0.894	0.708	71	F

Modifying Metadata

The Adat index and columns are pandas.MultiIndex objects. Several convenience functions exist to help you modify these objects. Typically, the row metadata (axis=0) represents data describing the sample or the individual from whom the sample was collected. The column metadata (axis=1) contains data regarding the SOMAmer reagent, the reagent's target and scalars applied to columns during normalization, these columns are not usually edited by the end user but can be using the same methods demonstrated on row metadata below.

return to top

Add Row Metadata

Row metadata is sample level information which could include added clinical data like age, sex or clinical measurements. The Adat class facilitates managing this data.

# using ittertools to sumulate some metadata:
from itertools import cycle, islice
import pandas as pd

# for demonstration we will simulate two types of metadata.  Metadata stored in a list and metadata stored with key-value pairs.
metadata_list = list(islice(cycle(['A', 'B', 'X', 'Y']), adat.shape[0]))
metadata_dictionary = {k:v for k, v in zip(adat.index.get_level_values('SampleId').to_list(), metadata_list)}

Add unlabeled metadata

You might do this if you know your metadata and Adat are ordered the same way but you are not using a shared key.

return to top

new_meta_adat = adat.insert_meta(0,'GroupData', metadata_list)
# this will produce a new Adat file with group data in the right most column of the index
new_meta_adat.index.to_frame(index=False).loc[0:6, ['PlateId', 'SampleId', 'GroupData']]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	PlateId	SampleId	GroupData
0	Example Adat Set001	1	A
1	Example Adat Set001	2	B
2	Example Adat Set001	3	X
3	Example Adat Set001	4	Y
4	Example Adat Set001	5	A
5	Example Adat Set001	6	B
6	Example Adat Set001	7	X

Add Keyed Metadata

You might have data coming in as key value pairs from another data source. In that case it is easier to insert metadata using those keys:

return to top

# The arguments are `axis` 0 for row metadata, 1 for column metadata, `name` the name of the new index,
#`key_meta_name` the nameo of the axis used to match the keys. `values_dict` the dictionary containing the new data
new_meta_adat = adat.insert_keyed_meta(0,'GroupData', 'SampleId', metadata_dictionary)
# this will produce a new Adat file with group data in the right most column of the index
new_meta_adat.index.to_frame(index=False).loc[0:6, ['PlateId', 'SampleId', 'GroupData']]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	PlateId	SampleId	GroupData
0	Example Adat Set001	1	A
1	Example Adat Set001	2	B
2	Example Adat Set001	3	X
3	Example Adat Set001	4	Y
4	Example Adat Set001	5	A
5	Example Adat Set001	6	B
6	Example Adat Set001	7	X

Replace Metadata with Unlabeled Metadata

You might do this if you know your metadata and Adat are ordered the same way but you are not using a shared key.

return to top

new_meta_adat = adat.replace_meta(0,'SampleName', metadata_list)
# this will produce a new Adat file with group data in the right most column of the index
new_meta_adat.index.to_frame(index=False).loc[0:6, ['PlateId', 'SampleId', 'SampleName']]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	PlateId	SampleId	SampleName
0	Example Adat Set001	1	A
1	Example Adat Set001	2	B
2	Example Adat Set001	3	X
3	Example Adat Set001	4	Y
4	Example Adat Set001	5	A
5	Example Adat Set001	6	B
6	Example Adat Set001	7	X

Replace Metadata with Keyed Metadata

You might need to replace metadata based on another document using key-value pairs.

return to top

#Here we replace the values in `SampleName` with the values from `metadata_dictionary`
new_meta_adat = adat.replace_keyed_meta(0,'SampleName', metadata_dictionary, 'SampleId')
# this will produce a new Adat file with group data in the right most column of the index
new_meta_adat.index.to_frame(index=False).loc[0:6, ['PlateId', 'SampleId', 'SampleName']]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	PlateId	SampleId	SampleName
0	Example Adat Set001	1	A
1	Example Adat Set001	2	B
2	Example Adat Set001	3	X
3	Example Adat Set001	4	Y
4	Example Adat Set001	5	A
5	Example Adat Set001	6	B
6	Example Adat Set001	7	X

Math

You may perform mathematical transformations on the feature data via apply or calling those functions and passing the entire dataframe.

import numpy as np
# Using the `adat` loaded from above
log10_adat = adat.apply(np.log10)  # equivalent to `np.log10(adat)`
rounded_adat = adat.apply(round, args=[5,])  # equivalent to `round(adat, 5)`
sqrt_adat = adat.apply(np.sqrt)  # equivlane to `np.sqrt(adat)`

Subsetting/Slicing the Dataframe

You may extract certain subgroups of samples and/or features. SomaData augments the pandas dataframe with a number of helper functions to aid the user.

return to top

# Extract rows of only calibrator-type samples
calibrator_adat = adat.pick_on_meta(axis=0, name='SampleType', values=['Calibrator'])

# Exclude calibrator-type samples
non_calibrator_adat = adat.exclude_on_meta(axis=0, name='SampleType', values=['Calibrator'])

# Extract columns containing features that start with 'MMP'
target_names = adat.columns.get_level_values('Target')
mmp_names = [target for target in target_names if target.startswith('MMP')]
mmp_adat = adat.pick_on_meta(axis=1, name='Target', values=mmp_names)
mmp_adat

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead tr th {
    text-align: left;
}

.dataframe thead tr:last-of-type th {
    text-align: right;
}

</style>

																																	SeqId	15419-15	2579-17	2788-55	2789-26	2838-53	4160-49	4496-60	4924-32	4925-54	5002-76	5268-49	6425-87	8479-4	9172-69	9719-145
																																	SeqIdVersion	3	5	1	2	1	1	2	1	2	1	3	3	3	3	3
																																	SomaId	SL012374	SL000527	SL000524	SL000525	SL003332	SL000124	SL000522	SL000521	SL000523	SL002646	SL003331	SL007616	SL000645	SL000526	SL003331
																																	TargetFullName	Matrix metalloproteinase-20	Matrix metalloproteinase-9	Stromelysin-1	Matrilysin	Matrix metalloproteinase-17	72 kDa type IV collagenase	Macrophage metalloelastase	Interstitial collagenase	Collagenase 3	Matrix metalloproteinase-14	Matrix metalloproteinase-16	Matrix metalloproteinase-19	Stromelysin-2	Neutrophil collagenase	Matrix metalloproteinase-16
																																	Target	MMP20	MMP-9	MMP-3	MMP-7	MMP-17	MMP-2	MMP-12	MMP-1	MMP-13	MMP-14	MMP-16	MMP19	MMP-10	MMP-8	MMP-16
																																	UniProt	O60882	P14780	P08254	P09237	Q9ULZ9	P08253	P39900	P03956	P45452	P50281	P51512	Q99542	P09238	P22894	P51512
																																	EntrezGeneID	9313	4318	4314	4316	4326	4313	4321	4312	4322	4323	4325	4327	4319	4317	4325
																																	EntrezGeneSymbol	MMP20	MMP9	MMP3	MMP7	MMP17	MMP2	MMP12	MMP1	MMP13	MMP14	MMP16	MMP19	MMP10	MMP8	MMP16
																																	Organism	Human	Human	Human	Human	Human	Human	Human	Human	Human	Human	Human	Human	Human	Human	Human
																																	Units	RFU	RFU	RFU	RFU	RFU	RFU	RFU	RFU	RFU	RFU	RFU	RFU	RFU	RFU	RFU
																																	Type	Protein	Protein	Protein	Protein	Protein	Protein	Protein	Protein	Protein	Protein	Protein	Protein	Protein	Protein	Protein
																																	Dilution	20	0.5	0.5	20	20	0.5	20	20	20	20	20	0.5	20	20	20
																																	PlateScale_Reference	937.5	19472.3	117.2	2392.9	1520.6	14888.5	1014.9	7611.5	376.6	632.1	565.8	5063.4	1534.0	1088.5	1166.8
																																	CalReference	937.5	19472.3	117.2	2392.9	1520.6	14888.5	1014.9	7611.5	376.6	632.1	565.8	5063.4	1534.0	1088.5	1166.8
																																	Cal_Example_Adat_Set001	1.06947296	1.01957222	0.98404702	0.90131455	1.13783298	0.98961103	0.96180819	0.91162239	0.98689727	1.02497162	0.97906212	0.97843478	0.95061040	0.94709823	1.02243253
																																	ColCheck	PASS	PASS	PASS	PASS	PASS	PASS	PASS	PASS	PASS	PASS	PASS	PASS	PASS	PASS	PASS
																																	CalQcRatio_Example_Adat_Set001_170255	1.132	1.002	0.893	1.130	0.955	0.987	1.014	1.054	1.023	0.987	1.024	1.005	1.023	1.029	1.003
																																	QcReference_170255	793.4	10420.6	124.2	9482.5	1252.0	16044.7	1115.3	13192.8	394.4	492.1	559.2	6162.7	1946.6	845.5	1171.0
																																	Cal_Example_Adat_Set002	1.02380692	1.00117741	1.31243001	0.67812509	1.09317038	0.98499534	0.95953484	0.95154455	0.90594178	1.08143713	0.98143972	0.98821187	0.92700024	1.03983569	1.02055454
																																	CalQcRatio_Example_Adat_Set002_170255	1.192	1.009	0.821	1.123	1.083	1.028	1.006	1.025	0.991	0.949	1.002	0.990	1.010	1.039	0.951
PlateId	PlateRunDate	ScannerID	PlatePosition	SlideId	Subarray	SampleId	SampleType	PercentDilution	SampleMatrix	Barcode	Barcode2d	SampleName	SampleNotes	AliquotingNotes	SampleDescription	AssayNotes	TimePoint	ExtIdentifier	SsfExtId	SampleGroup	SiteId	TubeUniqueID	CLI	HybControlNormScale	RowCheck	NormScale_20	NormScale_0_005	NormScale_0_5	ANMLFractionUsed_20	ANMLFractionUsed_0_005	ANMLFractionUsed_0_5	Age	Sex
Example Adat Set001	2020-06-18	SG15214400	H9	258495800012	3	1	Sample	20	Plasma-PPT															0.98185998	PASS	1.03693580	0.85701624	0.77717491	0.914	0.869	0.903	76	F	729.9	16230.2	177.9	11903.3	1378.1	11997.2	2455.9	20988.1	442.5	414.2	712.5	8965.5	2030.6	2181.5	1096.4
			H8	258495800004	7	2	Sample	20	Plasma-PPT															0.96671829	PASS	0.96022505	0.84858420	0.85201953	0.937	0.956	0.973	55	F	873.3	17253.4	152.8	16508.8	1652.0	14574.6	1595.0	11498.5	501.1	505.9	629.9	8669.7	1301.6	1571.2	1149.0
			H7	258495800010	8	3	Sample	20	Plasma-PPT															1.00193072	PASS	0.98411617	1.03270156	0.91519153	0.907	0.919	0.915	47	M	993.0	13094.1	127.5	7577.0	1711.7	14468.7	503.3	14697.7	2883.2	445.8	510.6	6728.9	1067.3	1528.9	1027.8
			H6	258495800003	4	4	Sample	20	Plasma-PPT															0.94017961	PASS	1.07839878	0.94626841	0.91246731	0.934	0.919	0.912	37	M	906.5	20991.4	155.0	8673.7	1667.5	9913.1	438.4	20819.0	375.2	644.8	547.8	8629.0	1162.4	1173.7	1091.6
			H5	258495800009	4	5	Sample	20	Plasma-PPT															0.94621098	PASS	0.84679446	0.92904553	0.77413056	0.707	0.894	0.708	71	F	747.0	8070.4	124.0	20423.7	1426.6	11345.0	1954.5	16168.9	356.9	446.4	541.7	8125.2	1667.0	1048.6	1132.6
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
Example Adat Set002	2020-07-20	SG15214400	A2	258495800108	3	188	Sample	20	Plasma-PPT															0.96699908	PASS	0.95993275	1.08910138	0.99491979	0.566	0.912	0.719	38	F	714.7	19877.3	114.5	15151.9	894.7	17266.2	682.8	27419.4	480.3	705.1	527.9	6638.2	3919.7	1787.5	1191.1
			A12	258495800104	2	189	Sample	20	Plasma-PPT															0.91482584	PASS	1.21880129	1.01022697	0.99244374	0.918	0.919	0.926	40	F	865.8	25801.4	123.6	15711.6	1308.4	16598.3	1369.2	15153.5	474.2	655.0	592.2	8953.9	2494.9	2156.2	1391.5
			A11	258495800108	5	190	Sample	20	Plasma-PPT															0.88282283	PASS	1.36699142	1.16271427	1.19673587	0.927	0.981	0.964	43	M	869.7	13728.5	140.9	11437.3	1353.5	17996.2	1344.2	10575.3	433.7	439.8	530.7	6193.9	2067.6	2466.6	1114.8
			A10	258495800105	5	191	Sample	20	Plasma-PPT															0.95792282	PASS	1.30590374	0.98395166	0.97460119	0.835	0.963	0.944	55	M	529.2	13298.2	161.7	14210.8	1026.0	14549.0	1466.1	9683.8	291.6	435.8	676.1	9584.8	1799.5	1347.8	1194.2
			A1	258495800110	5	192	Sample	20	Plasma-PPT															0.97384118	PASS	1.30710646	0.93230123	1.00804341	0.793	0.963	0.933	56	F	1934.4	7567.8	133.0	13614.3	1220.2	16223.8	1110.9	7737.4	364.2	3407.0	645.2	7867.1	1720.9	1888.2	1293.8

192 rows × 15 columns

Lifting ADAT data between assay versions.

Adat data can be lifted from one SomaScan Assay version's RFU space to another SomaScan Assay version's RFU space. This is achieved by scaling SOMAmer reagent measurement columns using scale factors available at menu.somalogic.com and built in to this tool in the Adat.lift() method.

The example Adat exists in SomaScan Version V4.0 assay space (also called 5K in some literature). In this example we will lift to SomaScan V5.0 (11K) assay space. It is important to know that only SOMAmer reagent measurements in both assay versions can be lifted. When lifting from a smaller to a larger plex (as demonstrated) the resulting Adat will remain in the smaller plex size. When lifting from a larger to smaller plex size reagents that don't exist in the small plex size will be scaled by 1.0. The end user might choose to redact the lifted Adat to the smaller plex to better merge data.

The tool will raise in error if the end user attempts to lift an Adat object to its current version or an unsupported assay version.

Assay version mapping:

SomaScan data can be referred to by the assay version i.e. 'V5.0' or by the plex size i.e. '11K'. The tool can use either 'V5.0' or '11K' interchangeably in it's input. The mapping between these terms is shown in the table below:

SomaScan Assay Version	SomaScan Plex	SomaScan Menu Size
V4	5K	5284
V4.1	7K	7596
V5.0	11K	11083

return to top

lifted_adat = adat.lift('V5.0')

lifted_adat

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead tr th {
    text-align: left;
}

.dataframe thead tr:last-of-type th {
    text-align: right;
}

</style>

																																	SeqId	10000-28	10001-7	10003-15	10006-25	10008-43	10011-65	10012-5	10013-34	10014-31	10015-119	...	9981-18	9983-97	9984-12	9986-14	9989-12	9993-11	9994-217	9995-6	9997-12	9999-1
																																	SeqIdVersion	3	3	3	3	3	3	3	3	3	3	...	3	3	3	3	3	3	3	3	3	3
																																	SomaId	SL019233	SL002564	SL019245	SL019228	SL019234	SL019246	SL014669	SL025418	SL007803	SL014924	...	SL018293	SL019202	SL019205	SL005356	SL019194	SL019212	SL019217	SL013164	SL019215	SL019231
																																	TargetFullName	Beta-crystallin B2	RAF proto-oncogene serine/threonine-protein kinase	Zinc finger protein 41	ETS domain-containing protein Elk-1	Guanylyl cyclase-activating protein 1	Inositol polyphosphate 5-phosphatase OCRL-1	SAM pointed domain-containing Ets transcription factor	Fc_MOUSE	Zinc finger protein SNAI2	Voltage-gated potassium channel subunit beta-2	...	Protein FAM234B	Inactive serine protease 35	Protein YIPF6	Neuropeptide W	Leucine-rich repeat-containing protein 24	Zinc finger protein 264	Potassium-transporting ATPase subunit beta	Deoxyuridine 5'-triphosphate nucleotidohydrolase, mitochondrial	UBX domain-containing protein 4	Interferon regulatory factor 6
																																	Target	CRBB2	c-Raf	ZNF41	ELK1	GUC1A	OCRL	SPDEF	Fc_MOUSE	SLUG	KCAB2	...	K1467	PRS35	YIPF6	Neuropeptide W	LRC24	ZN264	ATP4B	DUT	UBXN4	IRF6
																																	UniProt	P43320	P04049	P51814	P19419	P43080	Q01968	O95238	Q99LC4	O43623	Q13303	...	A2RU67	Q8N3Z0	Q96EC8	Q8N729	Q50LG9	O43296	P51164	P33316	Q92575	O14896
																																	EntrezGeneID	1415	5894	7592	2002	2978	4952	25803		6591	8514	...	57613	167681	286451	283869	441381	9422	496	1854	23190	3664
																																	EntrezGeneSymbol	CRYBB2	RAF1	ZNF41	ELK1	GUCA1A	OCRL	SPDEF		SNAI2	KCNAB2	...	KIAA1467	PRSS35	YIPF6	NPW	LRRC24	ZNF264	ATP4B	DUT	UBXN4	IRF6
																																	Organism	Human	Human	Human	Human	Human	Human	Human	Mouse	Human	Human	...	Human	Human	Human	Human	Human	Human	Human	Human	Human	Human
																																	Units	RFU	RFU	RFU	RFU	RFU	RFU	RFU	RFU	RFU	RFU	...	RFU	RFU	RFU	RFU	RFU	RFU	RFU	RFU	RFU	RFU
																																	Type	Protein	Protein	Protein	Protein	Protein	Protein	Protein	Protein	Protein	Protein	...	Protein	Protein	Protein	Protein	Protein	Protein	Protein	Protein	Protein	Protein
																																	Dilution	20	20	0.5	20	20	20	20	20	20	20	...	20	20	20	20	20	20	20	20	20	20
																																	PlateScale_Reference	687.4	227.8	126.9	634.2	585.0	2807.1	1623.3	499.6	857.2	443.3	...	643.9	430.0	627.5	3644.5	449.4	953.3	1971.1	1275.6	4426.9	851.9
																																	CalReference	687.4	227.8	126.9	634.2	585.0	2807.1	1623.3	499.6	857.2	443.3	...	643.9	430.0	627.5	3644.5	449.4	953.3	1971.1	1275.6	4426.9	851.9
																																	Cal_Example_Adat_Set001	1.01252025	1.01605709	0.95056180	0.99607350	0.94051447	1.05383489	1.17290462	1.07095391	1.03464092	1.07466667	...	0.98035932	1.04878049	1.03513692	0.96341431	1.01444695	1.04551437	0.98299422	0.97426106	0.96896272	0.96042841
																																	ColCheck	PASS	PASS	PASS	PASS	PASS	PASS	PASS	PASS	PASS	PASS	...	PASS	PASS	PASS	PASS	PASS	PASS	PASS	PASS	PASS	PASS
																																	CalQcRatio_Example_Adat_Set001_170255	1.008	0.970	1.046	1.042	1.036	0.975	1.010	0.953	0.978	0.975	...	0.982	0.949	1.003	0.938	1.017	0.998	1.071	0.985	0.960	0.974
																																	QcReference_170255	505.4	223.9	119.6	667.2	587.5	2617.6	1340.6	443.0	1289.4	441.5	...	700.7	393.2	612.6	3089.2	455.1	885.6	1389.7	950.9	5560.7	1033.6
																																	Cal_Example_Adat_Set002	1.01476233	1.03686846	1.15258856	0.93581231	0.96201283	1.03133955	1.21250373	1.18192572	0.98926717	1.13173347	...	0.96075798	1.15250603	1.12013567	1.08296437	0.99314917	1.08268030	1.02784586	0.97351752	0.94828953	0.92900763
																																	CalQcRatio_Example_Adat_Set002_170255	1.067	1.007	0.981	1.026	0.998	1.013	1.078	0.996	0.971	0.941	...	0.982	0.993	0.990	0.929	0.978	0.961	1.022	0.970	1.027	0.997
PlateId	PlateRunDate	ScannerID	PlatePosition	SlideId	Subarray	SampleId	SampleType	PercentDilution	SampleMatrix	Barcode	Barcode2d	SampleName	SampleNotes	AliquotingNotes	SampleDescription	AssayNotes	TimePoint	ExtIdentifier	SsfExtId	SampleGroup	SiteId	TubeUniqueID	CLI	HybControlNormScale	RowCheck	NormScale_20	NormScale_0_005	NormScale_0_5	ANMLFractionUsed_20	ANMLFractionUsed_0_005	ANMLFractionUsed_0_5	Age	Sex
Example Adat Set001	2020-06-18	SG15214400	H9	258495800012	3	1	Sample	20	Plasma-PPT															0.98185998	PASS	1.03693580	0.85701624	0.77717491	0.914	0.869	0.903	76	F	386.0	309.5	97.6	449.1	396.6	4965.9	1106.7	274.9	786.3	567.2	...	551.9	352.5	408.4	3027.5	538.8	686.3	5202.4	2188.4	12697.7	966.5
			H8	258495800004	7	2	Sample	20	Plasma-PPT															0.96671829	PASS	0.96022505	0.84858420	0.85201953	0.937	0.956	0.973	55	F	384.3	292.9	99.1	418.6	382.6	2149.6	1307.8	324.1	779.4	371.8	...	689.3	358.2	456.0	5724.5	470.3	663.0	1195.9	2302.7	13247.8	824.2
			H7	258495800010	8	3	Sample	20	Plasma-PPT															1.00193072	PASS	0.98411617	1.03270156	0.91519153	0.907	0.919	0.915	47	M	336.6	299.0	2948.3	420.1	299.3	2306.6	1290.9	348.6	845.0	416.8	...	547.0	382.0	503.9	3380.7	405.3	647.6	1552.4	1183.2	11072.1	1145.8
			H6	258495800003	4	4	Sample	20	Plasma-PPT															0.94017961	PASS	1.07839878	0.94626841	0.91246731	0.934	0.919	0.912	37	M	358.5	247.4	109.9	420.0	331.7	2261.4	1184.1	362.0	4348.0	374.6	...	561.9	384.4	477.7	1361.8	493.0	643.5	1005.3	1399.4	9082.4	804.6
			H5	258495800009	4	5	Sample	20	Plasma-PPT															0.94621098	PASS	0.84679446	0.92904553	0.77413056	0.707	0.894	0.708	71	F	377.2	709.3	93.3	589.3	313.1	1949.4	990.0	272.8	787.7	723.8	...	480.0	343.0	742.3	4846.0	487.7	701.4	1132.4	9852.9	38461.1	2865.9
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
Example Adat Set002	2020-07-20	SG15214400	A2	258495800108	3	188	Sample	20	Plasma-PPT															0.96699908	PASS	0.95993275	1.08910138	0.99491979	0.566	0.912	0.719	38	F	393.3	954.0	124.2	719.8	1284.4	1312.0	750.7	312.1	727.7	829.2	...	454.3	301.8	531.0	1658.5	541.8	861.5	1352.3	11604.6	45580.6	3687.1
			A12	258495800104	2	189	Sample	20	Plasma-PPT															0.91482584	PASS	1.21880129	1.01022697	0.99244374	0.918	0.919	0.926	40	F	337.4	281.6	82.5	625.5	26153.5	1856.4	1004.8	297.4	734.0	388.3	...	636.7	292.7	410.7	9236.8	487.6	629.3	1910.9	1855.0	9778.6	1004.4
			A11	258495800108	5	190	Sample	20	Plasma-PPT															0.88282283	PASS	1.36699142	1.16271427	1.19673587	0.927	0.981	0.964	43	M	372.9	270.8	204.2	472.6	446.1	1733.8	1067.7	354.3	745.7	410.7	...	566.0	364.5	448.2	2597.7	515.6	621.4	1113.3	1302.7	8766.2	770.8
			A10	258495800105	5	191	Sample	20	Plasma-PPT															0.95792282	PASS	1.30590374	0.98395166	0.97460119	0.835	0.963	0.944	55	M	320.4	319.1	105.9	527.1	370.8	1701.9	756.9	266.4	618.4	453.9	...	536.3	309.1	434.0	5167.2	522.8	588.2	891.1	2466.6	15455.9	1190.4
			A1	258495800110	5	192	Sample	20	Plasma-PPT															0.97384118	PASS	1.30710646	0.93230123	1.00804341	0.793	0.963	0.933	56	F	370.3	288.1	187.2	469.9	438.5	1777.9	787.3	249.4	930.8	423.7	...	601.9	285.9	467.0	2564.2	590.0	673.2	1036.7	2142.1	7950.7	722.8

192 rows × 5284 columns

# because some common documentation refers to plex size intead of assay version the tool also supports lifing by naming plex size.
lifted_adat = adat.lift('11K')

# Observing the Lin's CCC of the lifting scale factors.
from somadata.data import getSomaScanLiftCCC

# the method returns a pandas dataframe containing the available lift Lins's CCC values:
ccc = getSomaScanLiftCCC()

ccc

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Serum Lin's CCC v5.0 11K to v4.1 7K	Plasma Lin's CCC v5.0 11K to v4.1 7K	Serum Lin's CCC v5.0 11K to v4.0 5K	Plasma Lin's CCC v5.0 11K to v4.0 5K	Serum Lin's CCC v4.1 7K to v4.0 5K	Plasma Lin's CCC v4.1 7K to v4.0 5K
10000-28	0.977	0.982	0.970	0.966	0.967	0.963
10001-7	0.857	0.961	0.819	0.860	0.875	0.875
10003-15	0.759	0.787	0.761	0.674	0.774	0.668
10006-25	0.937	0.927	0.903	0.864	0.937	0.877
10008-43	0.951	0.939	0.915	0.879	0.925	0.908
...	...	...	...	...	...	...
9993-11	0.823	0.855	0.704	0.753	0.714	0.711
9994-217	0.492	0.964	0.502	0.767	0.809	0.778
9995-6	0.975	0.976	0.965	0.916	0.983	0.922
9997-12	0.877	0.955	0.857	0.892	0.926	0.885
9999-1	0.909	0.962	0.870	0.883	0.944	0.898

11083 rows × 6 columns

Lin's CCC Between Lifted and Assay Space Native Data

The tool allows you to display Lin's concordance correlation coefficient (Lin 1989) derived during the calculation of the lifting scale factors. This metric allows you to see how well lifted data is expected to correlate with date collected originally in the target assay data signal space. A Lin's CCC close to 1.0 indicates strong correlation indicating the signal would be highly concordant with the lifted value if the sample data were collected in the target assay version space.

# your exact transformation's Lin's CCC can be selected by filtering the column that contains your matrix and versions
# Lin's CCC are symetrical v5.0 -> v4.0 == v4.0 -> v5.0.
ccc["Plasma Lin's CCC v5.0 11K to v4.0 5K"]

10000-28    0.966
10001-7     0.860
10003-15    0.674
10006-25    0.864
10008-43    0.879
            ...
9993-11     0.753
9994-217    0.767
9995-6      0.916
9997-12     0.892
9999-1      0.883
Name: Plasma Lin's CCC v5.0 11K to v4.0 5K, Length: 11083, dtype: float64

Writing an `ADAT` file

In order to store or share analysis the user may need to write out an ADAT file. This utility supports writing to the file system.

return to top

adat.to_adat('/tmp/out_file.adat')

Typical Analyses

Although it is beyond the scope of the SomaData package, below are 3 sample analyses that typical users/clients would perform on SomaLogic data. They are not intended to be a definitive guide in statistical analysis and existing packages do exist in the Python universe that perform parts or extensions of these techniques. Many variations of the workflows below exist, however the framework highlights how one could perform standard preliminary analyses on SomaLogic data for:

Two-group differential expression (t-test)
Binary classification (logistic regression)
Linear regression

return to top

Compare Groups (M/F) via t-test

from somadata.data.example_data import example_data # Example ADAT included with SomaData
from scipy.stats import ttest_ind
from collections import Counter
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from io import StringIO

Display the shape of the adat (rows, columns)

example_data.shape

(192, 5284)

Describe the sample types within the adat and display their counts

Counter(example_data.index.get_level_values('SampleType'))

Counter({'Sample': 170, 'Calibrator': 10, 'Buffer': 6, 'QC': 6})

Prepare the adat for analysis

filtered_transformed_data = (
    example_data
        .exclude_on_meta(axis=0, name='Sex', values=[''])            # rm NAs if present
        .pick_on_meta(axis=0, name='SampleType', values=['Sample'])  # rm control samples
        .apply(np.log10)                                             # log10-transform
)

clean_data = (
    filtered_transformed_data
        .insert_keyed_meta(                                          # map Sex -> 0/1
            axis=0,
            key_meta_name='Sex',
            inserted_meta_name='Group',
            values_dict={'M': 1, 'F': 0}
        )
        .apply(lambda x: x - x.mean(), axis=0)                       # center features
        .apply(lambda x: x / x.std(), axis=0)                        # scale features
)

Display the grouping counts

print(clean_data.index.to_frame()['Sex'].value_counts())
print(clean_data.index.to_frame()['Group'].value_counts())

Sex
F    85
M    85
Name: count, dtype: int64
Group
0    85
1    85
Name: count, dtype: int64

Split the adat based on `Group` and perform t-test across all aptamers

tt_g0 = clean_data.pick_on_meta(axis=0, name='Group', values=[0])
tt_g1 = clean_data.pick_on_meta(axis=0, name='Group', values=[1])

tt_res = ttest_ind(tt_g0, tt_g1)
t_tests = list(zip(clean_data.columns.get_level_values('TargetFullName'), tt_res.pvalue))

Sort the results and display the 12 aptamers with the most significant p-values

t_tests_sorted = sorted(t_tests, key=lambda x: x[1])
tt_top_12_analytes = [name for name, p_value in t_tests_sorted[:12]]
tt_top_12_analytes

['Prostate-specific antigen',
 'Pregnancy zone protein',
 'Kunitz-type protease inhibitor 3',
 'Follicle stimulating hormone',
 'Ectonucleotide pyrophosphatase/phosphodiesterase family member 2',
 'Beta-defensin 104',
 'Luteinizing hormone',
 'Cysteine-rich secretory protein 2',
 'Human Chorionic Gonadotropin',
 'Serum amyloid P-component',
 'SLIT and NTRK-like protein 4',
 'Neurotrimin']

Plot the `Group` log(RFU) for each aptamer

tt_df= (
    filtered_transformed_data
        .pick_meta(axis=1, names=['TargetFullName'])
        .pick_on_meta(axis=1, name='TargetFullName', values=tt_top_12_analytes)[tt_top_12_analytes]
        .reset_index()
)

tt_melted_df = pd.melt(tt_df, value_vars=tt_top_12_analytes, id_vars='Sex', value_name='log10(RFU)')

tt_p = sns.catplot(
    x='Sex',
    y='log10(RFU)',
    col='TargetFullName',
    data=tt_melted_df,
    kind='box',
    col_wrap=3,
    sharey=False
)
tt_p.set_titles(row_template='{row_name}', col_template='{col_name}')
plt.show()

Logistic Regression (Predict Sex)

# Import the libraries that we need for this analysis
from sklearn.model_selection import train_test_split
from sklearn import metrics
from scipy.stats import pearsonr
import statsmodels.api as sm
from IPython.display import HTML

Prepare the data for LogisticRegression

# Wrangle `clean_data` into a simpler form
logr_x_df = (
    clean_data
        .pick_meta(axis=1, names=['SeqId', 'TargetFullName'])
        .reset_index(drop=True)
)
logr_y_df = (
    clean_data.index.get_level_values('Group')
)

# Split the dataset into train and test, holding back 25 samples for testing
logr_x_train, logr_x_test, logr_y_train, logr_y_test = train_test_split(logr_x_df, logr_y_df, test_size=25, random_state=0)

Perform univariate logistic regression for each aptamer

logr_apt_perf = []
for seq_info in logr_x_train:
    x = sm.add_constant(logr_x_train[seq_info]) # Need to add the intercept term since sm.GLM does not automatically do it
    mod = sm.GLM(logr_y_train, x, family=sm.families.Binomial())
    res = mod.fit()
    logr_apt_perf.append(res.summary2().tables[1].loc[[seq_info]])

Wrangle the GLM results of each aptamer into a dataframe and sort them by p-value

logr_df = pd.concat(logr_apt_perf).reset_index()
logr_df['SeqId'] = [x[0] for x in logr_df['index']]
logr_df['TargetFullName'] = [x[1] for x in logr_df['index']]
logr_df = logr_df.drop('index', axis=1)
logr_df = logr_df[['SeqId', 'TargetFullName', 'Coef.', 'Std.Err.', 'z', 'P>|z|', '[0.025', '0.975]']].set_index('SeqId')
logr_df_sorted = logr_df.sort_values('P>|z|')
HTML(logr_df_sorted.head(20).to_html()) # Need to use HTML here to display nicely for this README

	TargetFullName	Coef.	Std.Err.	z	P>\|z\|	[0.025	0.975]
SeqId
6580-29	Pregnancy zone protein	-3.079818	0.489558	-6.291020	3.153866e-10	-4.039334	-2.120302
5763-67	Beta-defensin 104	2.974778	0.478400	6.218181	5.029509e-10	2.037131	3.912425
3032-11	Follicle stimulating hormone	-1.505718	0.250398	-6.013292	1.817935e-09	-1.996490	-1.014946
7926-13	Kunitz-type protease inhibitor 3	2.887475	0.482526	5.984087	2.176067e-09	1.941742	3.833208
16892-23	Ectonucleotide pyrophosphatase/phosphodiesterase family member 2	-2.335113	0.396641	-5.887216	3.927542e-09	-3.112516	-1.557710
9282-12	Cysteine-rich secretory protein 2	1.768026	0.309050	5.720837	1.060006e-08	1.162299	2.373754
2953-31	Luteinizing hormone	-1.319728	0.240323	-5.491466	3.986115e-08	-1.790753	-0.848702
4914-10	Human Chorionic Gonadotropin	-1.244551	0.229781	-5.416240	6.086534e-08	-1.694914	-0.794188
8468-19	Prostate-specific antigen	5.841131	1.113030	5.247953	1.537986e-07	3.659632	8.022630
2474-54	Serum amyloid P-component	1.434929	0.279218	5.139108	2.760458e-07	0.887673	1.982185
8428-102	Neurotrimin	-1.264317	0.246142	-5.136543	2.798380e-07	-1.746745	-0.781888
9002-36	Serpin A11	-1.087385	0.219035	-4.964434	6.890169e-07	-1.516685	-0.658084
3066-12	Galectin-3	-1.005615	0.206735	-4.864276	1.148764e-06	-1.410808	-0.600423
5116-62	Roundabout homolog 2	-1.291594	0.270447	-4.775767	1.790237e-06	-1.821661	-0.761527
7139-14	SLIT and NTRK-like protein 4	1.018520	0.218625	4.658761	3.181183e-06	0.590023	1.447016
8484-24	Leptin	-0.991585	0.219260	-4.522415	6.113800e-06	-1.421326	-0.561843
5934-1	Ferritin	1.012300	0.227525	4.449188	8.619536e-06	0.566360	1.458240
15324-58	Ferritin light chain	1.002813	0.226429	4.428822	9.474919e-06	0.559021	1.446606
4234-8	Interleukin-1 receptor-like 1	1.134058	0.258632	4.384831	1.160761e-05	0.627149	1.640968
2696-87	Persephin	1.412833	0.323348	4.369389	1.245947e-05	0.779083	2.046584

Fit model

logr_top_analytes = [(index, row['TargetFullName']) for index, row in logr_df_sorted.head(5).iterrows()] # Select the top 5 aptamers based on p-value
x = sm.add_constant(logr_x_train[logr_top_analytes])
logr_mod = sm.GLM(logr_y_train, x, family=sm.families.Binomial())
logr_res = logr_mod.fit()
logr_res.summary()

Generalized Linear Model Regression Results

Dep. Variable:	y	No. Observations:	145
Model:	GLM	Df Residuals:	139
Model Family:	Binomial	Df Model:	5
Link Function:	Logit	Scale:	1.0000
Method:	IRLS	Log-Likelihood:	-8.4167
Date:	Fri, 01 Mar 2024	Deviance:	16.833
Time:	13:19:34	Pearson chi2:	17.8
No. Iterations:	10	Pseudo R-squ. (CS):	0.7186
Covariance Type:	nonrobust

	coef	std err	z	P>\|z\|	[0.025	0.975]
const	1.6106	1.178	1.367	0.172	-0.699	3.920
('6580-29', 'Pregnancy zone protein')	-7.0008	3.112	-2.250	0.024	-13.099	-0.902
('5763-67', 'Beta-defensin 104')	2.7821	1.255	2.217	0.027	0.322	5.242
('3032-11', 'Follicle stimulating hormone')	-1.1722	0.818	-1.432	0.152	-2.776	0.432
('7926-13', 'Kunitz-type protease inhibitor 3')	2.2901	1.053	2.174	0.030	0.226	4.354
('16892-23', 'Ectonucleotide pyrophosphatase/phosphodiesterase family member 2')	-3.6045	1.498	-2.407	0.016	-6.540	-0.669

# Create confusion matrix
x = sm.add_constant(logr_x_test[logr_top_analytes])
logr_predictions = [1 if val > 0.5 else 0 for val in logr_res.predict(x)]
cm = metrics.confusion_matrix(logr_y_test.values, logr_predictions)

# Print out performance metrics via Pandas
tp = cm[1, 1]
tn = cm[0, 0]
fp = cm[0, 1]
fn = cm[1, 0]
logr_perf_df = pd.DataFrame.from_records({
    'Sensitivity': tp / (tp + fn),
    'Specificity': tn / (tn + fp),
    'Accuracy': (tp + tn) / sum(sum(cm)),
    'PPV': tp / (tp + fp),
    'NPV': tn / (tn + fn)
}, index=['Value'])
HTML(logr_perf_df.to_html())

	Accuracy	NPV	PPV	Sensitivity	Specificity
Value	1.0	1.0	1.0	1.0	1.0

# Display the confusion matrix
plt.figure(figsize=(3,3))
sns.heatmap(cm, annot=True, fmt=".3f", linewidths=.5, square=True, cmap='Blues')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
all_sample_title = 'Accuracy Score: {0}'.format(100 * logr_perf_df['Accuracy'].values[0])
plt.title(all_sample_title)
plt.show()

Linear Regression (Predict Age)

We use the same clean_data as the logistic regression analysis above.

Wrangle data

# Wrangle `clean_data` into a simpler form
linr_x_df = (
    clean_data
        .pick_meta(axis=1, names=['SeqId', 'TargetFullName'])
        .reset_index(drop=True)
)
linr_y = (
    [float(age) for age in clean_data.index.get_level_values('Age')]
)

# Split the dataset into train and test, holding back 25 samples for testing
linr_x_train, linr_x_test, linr_y_train, linr_y_test = train_test_split(linr_x_df, linr_y, test_size=25, random_state=5)

Perform univariate linear regression for each aptamer

linr_apt_perf = []
for seq_info in linr_x_df:
    x = sm.add_constant(linr_x_train[seq_info])
    mod = sm.OLS(linr_y_train, x)
    res = mod.fit()
    linr_apt_perf.append(res.summary2().tables[1].loc[[seq_info]])

Wrangle the GLM results of each aptamer into a dataframe and sort them by p-value

linr_res_df = pd.concat(linr_apt_perf).reset_index()
linr_res_df['SeqId'] = [x[0] for x in linr_res_df['index']]
linr_res_df['TargetFullName'] = [x[1] for x in linr_res_df['index']]
linr_res_df = linr_res_df.drop('index', axis=1)
linr_res_df = linr_res_df[['SeqId', 'TargetFullName', 'Coef.', 'Std.Err.', 't', 'P>|t|', '[0.025', '0.975]']].set_index('SeqId')
linr_sorted_res_df = linr_res_df.sort_values('P>|t|')
HTML(linr_sorted_res_df.head(20).to_html())

	TargetFullName	Coef.	Std.Err.	t	P>\|t\|	[0.025	0.975]
SeqId
3045-72	Pleiotrophin	6.713339	0.865578	7.755906	1.506400e-12	5.002359	8.424320
4374-45	Growth/differentiation factor 15	6.766537	0.902926	7.494011	6.377086e-12	4.981730	8.551343
3024-18	Alpha-2-antiplasmin	-6.258739	0.895850	-6.986373	9.854830e-11	-8.029558	-4.487920
6392-7	WNT1-inducible-signaling pathway protein 2	6.206203	0.895426	6.931007	1.321588e-10	4.436222	7.976185
8480-29	EGF-containing fibulin-like extracellular matrix protein 1	6.179473	0.900370	6.863260	1.889770e-10	4.399719	7.959227
15640-54	Transgelin	6.159769	0.905043	6.806048	2.552783e-10	4.370777	7.948761
15533-97	Macrophage scavenger receptor types I and II	5.986741	0.907615	6.596127	7.616175e-10	4.192666	7.780815
15386-7	Fatty acid-binding protein, adipocyte	6.130562	0.931954	6.578182	8.355679e-10	4.288376	7.972748
16818-200	CUB domain-containing protein 1	5.919909	0.902842	6.556970	9.321408e-10	4.135268	7.704550
4496-60	Macrophage metalloelastase	6.149946	0.940133	6.541570	1.009072e-09	4.291592	8.008299
3362-61	Chordin-like protein 1	5.765444	0.913703	6.309975	3.287540e-09	3.959334	7.571554
4541-49	Cell adhesion molecule-related/down-regulated by oncogenes	-5.703166	0.906248	-6.293164	3.578780e-09	-7.494540	-3.911793
3600-2	Chitotriosidase-1	5.831590	0.951184	6.130871	8.071212e-09	3.951391	7.711789
2609-59	Cystatin-C	5.577894	0.934072	5.971588	1.773159e-08	3.731521	7.424267
3234-23	Coiled-coil domain-containing protein 80	5.647487	0.948795	5.952270	1.949244e-08	3.772010	7.522963
14133-93	Interleukin-1 receptor type 2	-5.489368	0.926319	-5.926004	2.216438e-08	-7.320415	-3.658321
19601-15	Ankyrin repeat and SOCS box protein 9	5.412074	0.930313	5.817474	3.755863e-08	3.573131	7.251018
9793-145	Immunoglobulin superfamily DCC subclass member 4	-5.292703	0.911239	-5.808247	3.927116e-08	-7.093942	-3.491463
2677-1	Epidermal growth factor receptor	-5.341396	0.919656	-5.808039	3.931061e-08	-7.159272	-3.523520
4968-50	Macrophage-capping protein	5.345710	0.926458	5.770050	4.721157e-08	3.514387	7.177033

Feed top 8 SOMAmers into statsmodels OLS regression

linr_top_analytes = [(index, row['TargetFullName']) for index, row in linr_sorted_res_df.head(8).iterrows()]
x = sm.add_constant(linr_x_train[linr_top_analytes])
mod = sm.OLS(linr_y_train, x).fit()
mod.summary()

OLS Regression Results

Dep. Variable:	y	R-squared:	0.501
Model:	OLS	Adj. R-squared:	0.471
Method:	Least Squares	F-statistic:	17.05
Date:	Fri, 01 Mar 2024	Prob (F-statistic):	2.29e-17
Time:	13:20:02	Log-Likelihood:	-522.29
No. Observations:	145	AIC:	1063.
Df Residuals:	136	BIC:	1089.
Df Model:	8
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	55.5436	0.765	72.602	0.000	54.031	57.057
('3045-72', 'Pleiotrophin')	1.6913	1.197	1.413	0.160	-0.676	4.059
('4374-45', 'Growth/differentiation factor 15')	1.2404	1.258	0.986	0.326	-1.247	3.728
('3024-18', 'Alpha-2-antiplasmin')	-2.5113	0.910	-2.758	0.007	-4.312	-0.711
('6392-7', 'WNT1-inducible-signaling pathway protein 2')	1.5143	0.997	1.519	0.131	-0.457	3.486
('8480-29', 'EGF-containing fibulin-like extracellular matrix protein 1')	2.1363	0.972	2.197	0.030	0.214	4.059
('15640-54', 'Transgelin')	1.2006	1.010	1.189	0.237	-0.796	3.198
('15533-97', 'Macrophage scavenger receptor types I and II')	0.8792	1.223	0.719	0.474	-1.540	3.298
('15386-7', 'Fatty acid-binding protein, adipocyte')	1.1453	1.180	0.971	0.333	-1.188	3.479

Omnibus:	2.712	Durbin-Watson:	2.042
Prob(Omnibus):	0.258	Jarque-Bera (JB):	2.501
Skew:	-0.322	Prob(JB):	0.286
Kurtosis:	3.008	Cond. No.	4.53

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Compute predictions on test set

x = sm.add_constant(linr_x_test[linr_top_analytes])
linr_predictions = mod.predict(x)
linr_pred_df = pd.DataFrame({
    'Actual Age': linr_y_test,
    'Predicted Age': linr_predictions
})

linr_pred_df['Pred Error'] = linr_pred_df['Predicted Age'] - linr_pred_df['Actual Age']

Compute model performance

# Lin's Concordance Correl. Coef.
# Accounts for location + scale shifts
def linCCC(x, y):
    if len(x) != len(y):
        raise Exception('Arrays are not the same length!')
    a = 2 * pearsonr(x, y)[0] * np.std(x, ddof=1) * np.std(y, ddof=1)
    b = np.var(x, ddof=1) + np.var(y, ddof=1) + (np.mean(x) - np.mean(y))**2
    return a / b

n = linr_x_test.shape[0]
p = len(linr_top_analytes)

# Regression metrics
linr_metrics_df = pd.DataFrame({
    'rss': linr_pred_df['Pred Error'].apply(lambda x: x**2).sum(), # residual sum of squares
    'tss': sum((linr_pred_df['Actual Age'] - linr_pred_df['Actual Age'].mean()) ** 2), # total sum of squares
    'R2': pearsonr(linr_pred_df['Actual Age'], linr_pred_df['Predicted Age'])[0] ** 2, # R-squared Pearson approx.
    'MAE': np.mean(np.abs(linr_pred_df['Pred Error'])), # Mean absolute error
    'RMSE': np.sqrt(np.mean(linr_pred_df['Pred Error'] ** 2)), # Root mean squared error
    'CCC': linCCC(linr_predictions, linr_y_test) # Lins concordance correlation coefficient
}, index=['Value'])

linr_metrics_df['rsq'] = 1 - (linr_metrics_df['rss'] / linr_metrics_df['tss']) # R-squared
linr_metrics_df['rsqadj'] = max(0, 1 - (1 - linr_metrics_df['rsq'][0]) * (n - 1) / (n - p - 1)), # Adjusted R-squared

HTML(linr_metrics_df.to_html())

	rss	tss	R2	MAE	RMSE	CCC	rsq	rsqadj
Value	989.768231	2771.84	0.674484	5.214434	6.292116	0.752326	0.64292	0.46438

Visualize performance via concordance plot of predicted and actual values

f, ax = plt.subplots(1, figsize=(5, 5), dpi=150)
plot_range = [linr_pred_df[['Actual Age', 'Predicted Age']].min().min() * 0.95, linr_pred_df[['Actual Age', 'Predicted Age']].max().max() * 1.05]
ax.plot(plot_range, plot_range, c='g')
ax.scatter(linr_pred_df['Actual Age'], linr_pred_df['Predicted Age'], alpha=0.5)
ax.set(
    xlim=plot_range,
    xlabel='Actual Age',
    ylim=plot_range,
    ylabel='Predicted Age',
    title='Concordance in Predicted vs. Actual Age'
)
plt.show()

Closing Remarks

Many variants of above possible.
Goal to provide general framework to handle SomaLogic data.
Not definitive guide in statistical theory, etc.

MIT LICENSE

See LICENSE
The MIT License:
- https://choosealicense.com/licenses/mit/
- https://tldrlegal.com/license/mit-license/

return to top

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
.github/workflows		.github/workflows
README_files		README_files
bin		bin
docs		docs
somadata		somadata
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.ipynb		README.ipynb
README.md		README.md
pyproject.toml		pyproject.toml

License

SomaLogic/Canopy

Folders and files

Latest commit

History

Repository files navigation

The Python SomaData Package from Somalogic, Inc.

Overview

Table of Contents:

Installation

Dependencies

Basics

For a traversable index of the library:

Internal Objects

Main Features (I/O)

Loading an ADAT

Wrangling

Accessing feature data

Display features

Get # Features

Clinical Data

Modifying Metadata

Add Row Metadata

Add unlabeled metadata

Add Keyed Metadata

Replace Metadata with Unlabeled Metadata

Replace Metadata with Keyed Metadata

Math

Subsetting/Slicing the Dataframe

Lifting ADAT data between assay versions.

Assay version mapping:

Lin's CCC Between Lifted and Assay Space Native Data

Writing an ADAT file

Typical Analyses

Compare Groups (M/F) via t-test

Display the shape of the adat (rows, columns)

Describe the sample types within the adat and display their counts

Prepare the adat for analysis

Display the grouping counts

Split the adat based on Group and perform t-test across all aptamers

Sort the results and display the 12 aptamers with the most significant p-values

Plot the Group log(RFU) for each aptamer

Logistic Regression (Predict Sex)

Prepare the data for LogisticRegression

Perform univariate logistic regression for each aptamer

Wrangle the GLM results of each aptamer into a dataframe and sort them by p-value

Fit model

Linear Regression (Predict Age)

Wrangle data

Perform univariate linear regression for each aptamer

Wrangle the GLM results of each aptamer into a dataframe and sort them by p-value

Feed top 8 SOMAmers into statsmodels OLS regression

Compute predictions on test set

Compute model performance

Visualize performance via concordance plot of predicted and actual values

Closing Remarks

MIT LICENSE

About

Resources

License

Stars

Watchers

Forks

Languages

The Python `SomaData` Package from Somalogic, Inc.

Writing an `ADAT` file

Split the adat based on `Group` and perform t-test across all aptamers

Plot the `Group` log(RFU) for each aptamer