Thunder

Large-scale neural data analysis with Spark

About

Spark is a powerful new framework for cluster computing, particularly well suited to iterative computations. Thunder is a family of analyses for finding structure in neural data using machine learning algorithms. It's fast to run, easy to develop for, and can be run interactively.

Thunder includes low-level utilties for data loading, saving, signal processing, and shared algorithms (regression, factorization, etc.), and high-level functions that can be scripted to easily combine analyses. The entire package is written in Spark's Python API (Pyspark), making use of scipy and numpy. We plan to port some or all functionality to Scala in the future (e.g. for streaming), but for now all scala functions should be considered prototypes.

Quick start

Here's a quick guide to getting up and running. It assumes Scala 2.9.3, Spark 0.8.1, and Python 2.7.6 (with NumPy, SciPy, and Python Imaging Library) are already installed. First, download the latest build and add it to your path.

PYTHONPATH=your_path_to_thunder/python/:$PYTHONPATH

Now go into the top-level Thunder directory and run an analysis on test data.

$SPARK_HOME/pyspark python/thunder/factorization/pca.py local data/iris.txt ~/results 4

This will run principal components on the “iris” data set with 4 components, and write results to a folder in your home directory. The same analysis can be run interactively in a shell. Start Pyspark:

$SPARK_HOME/pyspark

Then run the same analysis

>> from thunder.util.parse import parse
>> from thunder.factorization.pca import pca
>> lines = sc.textFile(”data/iris.txt”)
>> data = parse(lines).cache()
>> scores, latent, comps = pca(data, 4)

Analyses

Thunder currently includes four packages: clustering, factorization, regression, and signal processing, as well as a utils for shared methods like loading and saving (see Input format and Output format). Individual packages include both high-level analyses and underlying methods and algorithms. There are several stand-alone analysis scripts for common analyses, but the same functions (or sub-functions) can be used from within the Pyspark shell for easy interactive analysis. Here is a list of the primary analyses:

clustering

kmeans - k-means clustering

factorization

pca - principal components analysis
ica - independent components analysis

regression

regress - regression (linear and bilinear)
tuning - parameteric tuning curves (circular and gaussian)

signal processing

crosscorr - signal cross-correlation
fourier - fourier analysis
localcorr - local spatial time series correlations
stats - summary statistics (mean, std, etc.)
query - average over indices

Input and output

All functions use the same format for primary input data: a text file, where the rows are neural signals (e.g. voxels, neurons) and the columns are time points. The first entries in each row are optional key identifiers (e.g. the x,y,z coordinates of each voxel), and subsequent entries are the response values for that signal at each time point (e.g. calcium flouresence, spike counts). For example, an imaging data set with 2x2x2 voxels and 8 time points might look like:

1 1 1 11 41 2 17 43 24 56 87
1 2 1 ...
2 1 1 ...
2 2 1 ...
1 1 2 ...
1 2 2 ...
2 1 2 ...
2 2 2 ...

Subsets of neural signals (e.g. from different imaging planes) can be stored in separate text files within the same directory, or all in one file. Covariates (e.g. related to the stimulus or task, for regression analyses) can be loaded from MAT files or provided directly as numpy arrays, see appropriate functions for more details.

When parsing data, preprocessing can be applied to each neural signal (e.g. conversion to dF/F for imaging data).

Results can be saved as MAT files, text files, or images (including automatic rescaling).

Name		Name	Last commit message	Last commit date
Latest commit History 1,475 Commits
data		data
helper		helper
python		python
scala		scala
.gitignore		.gitignore
.travis.yml		.travis.yml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

helper

helper

python

python

scala

scala

.gitignore

.gitignore

.travis.yml

.travis.yml

README.md

README.md

Repository files navigation

Thunder

About

Quick start

Analyses

clustering

factorization

regression

signal processing

Input and output

About

Releases

Packages

CheMcCandless/thunder

Folders and files

Latest commit

History

Repository files navigation

Thunder

About

Quick start

Analyses

clustering

factorization

regression

signal processing

Input and output

About

Resources

Stars

Watchers

Forks