Skip to content

tuqiang2014/scRNA

 
 

Repository files navigation

scRNA

Python framework for single-cell RNA-seq clustering with special focus on transfer learning. This package contains methods for generating artificial data, clustering, and transfering knowledge from a source to a target datasets.

This software was written by Nico Goernitz, Bettina Mieth, Marina Vidovic, Alex Gutteridge.

Travis-CI

News

  • (3.4.17) Added Travis-CI
  • (3.4.17) Added string label support
  • Simple example available
  • Website is up and running
  • Wiki with detailed information (e.g. command line arguments)
  • Please report Bugs or other inconveniences
  • scRNA can now be conveniently installed using the pip install git+https://github.com/nicococo/scRNA.git command (see Installation for further information)
  • Command line script available

Getting started

Installation

We assume that Python >2.7 is installed and the pip command is callable from the command line. If starting from scratch, we recommend installing the Anaconda open data science platform (w/ Python 2.7) which comes with a bunch of most useful packages for scientific computing.

The scRNA software package can be installed using the pip install git+https://github.com/nicococo/scRNA.git command. After successful completion, three command line arguments will be available for MacOS and Linux only:

  • scRNA-generate-data.sh
  • scRNA-source.sh
  • scRNA-target.sh

Example

Step 1: Installation with pip install git+https://github.com/nicococo/scRNA.git Installation with pip install git+https://github.com/nicococo/scRNA.git

Step 2: Check the scripts Check for the scripts

Step 3: Create directory /foo. Go to directory /foo. Generate some artificial data by simply calling the scRNA-generate-data.sh (using only default parameters).

Generate artificial data

This will result in a number of files:

  • Gene ids
  • Source- and target data
  • Source- and target ground truth labels

Step 4: NMF of source data using the provided gene ids and source data. Ie. we want to turn off the cell- and gene-filter as well as the log transformation. You can provide source labels to be used as a starting point for NMF. If not those labels will be generated via NMF Clustering. Potential problems:

  • If a ''Intel MKL FATAL ERROR: Cannot load libmkl_avx.so or libmkl_def.so.'' occurs and Anaconda open data science platform is used, then use conda install mkl first.
  • Depending on the data and cluster range, this step can take time. However, you can speed up the process by tuning off the t-SNE plots using the --no-tsne command (see Wiki for further information)

Cluster the source data

This will result in a number of files:

  • t-SNE plots (.png) for every number of cluster as specified in the --cluster-range argument (default 6,7,8)
  • Output source model in .npz format for every number of cluster as specified in the --cluster-range argument (default 6,7,8)
  • A summarizing .png figure
  • True cluster labels - either as provided from user or as generated via NMF Clustering - (and corresponding cell id) in .tsv format for every number of cluster as specified in the --cluster-range argument (default 6,7,8)
  • Model cluster labels after NMF (and corresponding cell id) in .tsv format for every number of cluster as specified in the --cluster-range argument (default 6,7,8)

Step 5: Now, it is time to cluster the target data and transfer knowledge from the source model to our target data. Therefore, we need to choose a source data model which was generated in Step 4. In this example, we will pick the model with 8 cluster (src_c8.npz).

  • Depending on the data, the cluster range and the mixture range, this step can take a long time. However, you can speed up the process by tuning off the t-SNE plots using the --no-tsne command (see Wiki for further information)

Cluster the target data

Which results in a number of files (for each value in the cluster range).

  • Predicted cluster labels after transfer learning (and corresponding cell id) in .tsv format for every number of cluster as specified in the --cluster-range argument (default 6,7,8)
  • t-SNE plots with predicted labels (.png)
  • Data and gene ids in .tsv files

In addition there is a summarizing .png figure of all accs and a t-SNE plot with the real target labels, if they were provided.

Cluster the target data

Command line output shows a number of results: unsupervised and supervised (if no ground truth labels are given this will remain 0.) accuracy measures.

About

Python framework for single-cell RNA-seq clustering with special focus on transfer learning.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 95.2%
  • Jupyter Notebook 3.1%
  • R 1.7%