Skip to content

pyProCT is an open source cluster analysis software especially adapted for jobs related with structural proteomics. Its approach allows users to define a clustering goal (clustering hypothesis) based on their domain knowledge. This hypothesis will guide the software in order to find the best algorithm and parameters (including the number of clus…

License

lowks/pyProCT

 
 

Repository files navigation

pyProCT

pyProCT is an open source cluster analysis software especially adapted for jobs related with structural proteomics. Its approach allows users to define a clustering goal (clustering hypothesis) based on their domain knowledge. This hypothesis will guide the software in order to find the best algorithm and parameters (including the number of clusters) to obtain the result that better fulfills their expectatives. In this way users do not need to use cluster analysis algorithms as a black box, which will (hopefully) improve their results. pyProCT not only generates a resulting clustering, it also implements some use cases like the extraction of representatives or trajectory redundance elimination.

Table of Contents

Installation

pyProCT is quite easy to install using pip. Just write:

> sudo pip install pyProCT

And pip will take care of all the dependencies (shown below).

It is recommended to install Numpy and Scipy before starting the installation using your OS software manager. You can try to download and install them manually if you dare.

mpi4py is pyProCT's last dependency. It can give problems when installing it in OS such as SUSE. If the installation of this last package is not succesful, pyProCT can still work in Serial and Parallel (using multiprocessing) modes.

Using pyProCT as a standalone program

The preferred way to use pyProCT is throug a JSON "script" that describes the clustering task. It can be executed using the following line in your shell:

> python -m pyproct.main script.json

The JSON script has 4 main parts, each one dealing with a different aspect of the clustering pipeline. This sections are:

  • global: Handles workspace and scheduler parameterization.
  • data: Handles distance matrix parameterization.
  • clustering: Handles algorithms and evaluation parameterization.
  • preprocessing: Handles what to do with the clustering we have calculated.
{
	"global":{},
	"data":{},
	"clustering":{},
	"postprocessing":{}
}

Global

{
	"control": {
		"scheduler_type": "Process/Parallel",
		"number_of_processes": 4
	},
	"workspace": {
		 "tmp": "tmp",
		 "matrix": "matrix",
		 "clusterings": "clusterings",
		 "results": "results",
		 "base": "/home/john/ClusteringProject"
	}
}

This is an example of global section. It describes the work environment (workspace) and the type of scheduler that will be built. Defining the subfolders of the wokspace is not mandatory, however it may be convenient in some scenarios (for instance, in serial multiple clustering projects, sharing the tmp folder would lower the disk usage as at each step it will be ooverwritten).

This is a valid global section using a serial scheduler and default names for workspace inner folders:

{
	"control": {
		"scheduler_type": "Serial"
	},
	"workspace": {
		 "base": "/home/john/ClusteringProject"
	}
}

pyProCT allows the use of 3 different schedulers that help to improve the overall performance of the software by parallelizing some parts of the code. The available schedulers are "Serial", "Process/Parallel" (uses Python's multiprocessing) and "MPI/Parallel" (uses MPI through the module mpi4py).

Data

The data section defines how pyProCT must build the distance matrix that will be used by the compression algorithms. Currently pyProCT offers up to three options to build that matrix: "load", "rmsd" and "distance"

  • rmsd: Calculates a all vs all rmsd matrix using any of the pyRMSD calculators available. It can calculate the RMSD of the fitted region (defined by Prody compatible selection string in fit_selection) or one can use one selection to superimpose and another to calculate the rmsd (calc_selection) .
  • distance: After superimposing the selected region it calculates the all vs all distances of the geometrical center of the region of interest (body_selection).
  • load: Loads a precalculated matrix.
{
	"type": "pdb_ensemble",
	"files": [
		"A.pdb",
		"B.pdb"
	],
	"matrix": {
		"method": "rmsd",
		"parameters": {
			"calculator_type": "QCP_OMP_CALCULATOR",
			"fit_selection": "backbone",
		},
		"image": {
			"filename": "matrix_plot"
		},
		"filename":"matrix"
	}
}

The matrix can be stored if the filename property is defined. The matrix can also be stored as an image if the image property is defined.

pyProCT can currently load pdb and dcd files. When using pdb files, files can be loaded in two ways:

  1. Using a list of file paths. If the file extension is ".txt" or ".list" it will be treated as a pdb list file. Each line of such files will be a pdb path or a pdb path and a selection string, separated by comma.
A.pdb, name CA
B.pdb
C.pdb, name CA
...
  1. Using a list of file objects:
{
	"file": ... ,
	"base_selection": ...
}

Where base_selection is a Prody compatible selection string. Loading files this way can help in cases where not all files have structure with the same number of atoms: base_selection should define the common region between them (if a 1 to 1 map does not exist, the RMSD calculation will be wrong).

  1. Only for dcd files:
{
	"file": ...,
	"atoms_file": ...,
	"base_selection": ...
}

Where atoms_file is a pdb file with at least one frame that holds the atomic information needed by the dcd file.

Clustering

The clustering section is divided in 3 other subsections:

{
	"generation": {
		"method": "generate"
	},
	"algorithms": {
		...
	},
	"evaluation": {
		...
	}
}

generation

Defines how the clustering will be generated (load or generate). if load is chosen, the section must contain the clustering that may be used. Ex.:

{
	"clustering": {
		"generation": {
			"method" : "load",
			"clusters": [
					{
						"prototype " : 16,
						"id": "cluster_00",
						"elements" : "9, 14:20"
					},
					{
						"prototype": 7,
						"id": "cluster_01",
						"elements": "0:8, 10:14, 21"
					}
			]
		}
}

algorithms

If pyProCT has to generate the clustering, this section defines the algorithms that will be used as well as their parameters (if necessary). The currently available algorithms are : kmedoids, hierarchical, dbscan, gromos, spectral and random. Each algorithm can store its list of parameters, however the preferred way to work with pyProCT is to let it automatically generate them. Almost all algorithms accept the property max, that defines the maximum amount of parameter collections that will be generated for that algorithm.

Ex.

{
	"kmedoids": {
		"seeding_type": "RANDOM",
		"max": 50,
		"tries": 5
	},
	"hierarchical": {

	},
	"dbscan": {
		"max": 50
	},
	"gromos": {
		"max": 50
	},
	"spectral": {
		"max": 50,
		"force_sparse":true,
	}
}

Algorithm parameters can be explicitly written:

{
	"kmedoids": {
		"seeding_type": "RANDOM",
		"max": 50,
		"tries": 5,
		"parameters":[{"k":4},{"k":5},{"k":6}]
	}
}

evaluation

This section holds the Clustering Hypothesis, the core of pyProCT. Here the user can define how the expected clustering will be. First the user must set the expected number of clusters range. Also, an estimation of the dataset noise and the cluster minimum size (the minimum number of elements a cluster must have to not be considered noise) will complete the quantitative definition of the target result.

Ex.

{
	"maximum_noise": 15,
	"minimum_cluster_size": 50,
	"maximum_clusters": 200,
	"minimum_clusters": 6,
	"query_types": [ ... ],
	"evaluation_criteria": {
		...
	}
}

The second part of the Clustering Hypothesis tries to characterize the clustering internal traits in a more qualitative way. Concepts like cluster "Compactness" or "Separation" can be used here to define the expected clustering. To this end users must write their expectations in form of criteria. This criteria are, in general, linear combinations of Internal Clustering Validation Indices (ICVs). The best clustering will be the one that gets the best score in any of these criteria. See this document to get more insight about the different implemented criteria and their meaning.

Additionally users may choose to ask pyProCT about the results of this ICVs and other evaluation functions(e.g. the average cluster size) by adding them to the queries array.

Postprocessing

Getting a good quality clustering is not enough, we would like to use them to extract information. pyProCT implements some use cases that may help users to extract this information.

{
	"rmsf":{},

	"centers_and_trace":{},

	"representatives":{
		"keep_remarks": [true/false],
		"keep_frame_number": [true/false]
	},

	"pdb_clusters":{
		"keep_remarks": [true/false],
		"keep_frame_number": [true/false]
	},

	"compression":{
		"final_number_of_frames": INT,
		"file": STRING,
		"type":[‘RANDOM’,’KMEDOIDS’]
	},

	"conformational_space_comparison":{}
}
  • rmsf : Calculates the global and per-cluster (and per-residue) root mean square fluctuation (to be visualized using the GUI).
  • centers_and_trace : Calculates all geometrical centers of the calculation selection of the system (to be visualized using the GUI).
  • representatives : Extracts all the representatives of the clusters in the same pdb.
  • pdb_clusters : Extracts all clusters in separate pdbs.
  • compression : Reduces the redundancy of the trajectory using the resulting clustering.
  • conformational_space_comparison : Work in progress.

Script validation

As the "script" is indeed a JSON object, any JSON validator can be used to discover the errors in case of script loading problems. A good example of such validators is JSONLint.

Using pyProCT as part of other programs

  • Using algorithms
  • Clustering from label lists
  • Using ICVs with custom clusterings
  • Performing the whole protocol
Driver(Observer()).run(parameters)

The necessary documentation to use pyProCT classes is written inside the code. It has been extracted here and here. We are currently trying to improve this documentation with better explanations and examples.

See this file.

Using it as a separate program from other Python script

  • Loading results
  • Generating scripts programatically

See this project for some examples.

Parallel execution

To execute pyProCT in parallel you just need to issue this line:

> mpirun -np NumberOfProcesses -m pyproct.main --mpi script.json

When running pyProCT using MPI you will need to use the MPI/Parallel Scheduler or it will just execute several independent serial runs.

Remember that you need to use the same libraries and versions to build mpi4py and mpirun, otherwise you won't be able to execute it.

Documentation

We are still experimenting to see which documentation generator fits better with us. Currently we have two versions of the documentations: one using Sphinx and the other using Doxygen+doxpy. See them here and here. We will possibly publish it in a cloud solution like readthedocs.org

Learn more

A more detailed explanation of the script contents can be found here, and a discussion about the different implemented ICVs can be found here.

Please, do not hesitate to send a mail to victor.gil.sepulveda@gmail.com with your questions, criticisms and whatever you think it is not working or can be done better. It will help to improve the software!

TODO

  • To improve this documentation (better explanations, more examples and downloadable scripts).

  • Refactoring and general improvements:

    • Total refactoring (Clustering and Clusters are inmutable, hold a reference to the matrix -> prototypes are always updated)
    • Rename script stuff
    • Rename functions and vars
    • Minimizing dependencies with scipy
    • Minimizing dependencies with prody (create my own reader)
    • Adding its own Hierarchical clustering code (educational motivations)
    • Improve spectral algoritm (add more tests - comparisons with other implementations, adding new types)
    • Improve MPI load balance (i.e. parameter generation must be processed in parallel)
    • Improve test coverage
    • The script must accept numbers and percentages
    • Use JSON schema to validate the script. Try to delegate the full responsibility of validating to pyProCT (instead of the GUI)
    • When loading a dcd file, we only want to load atomic data of the the associated pdb.
    • Change "compression" by "redundancy_elimination"
    • Allow to load all files (or glob) from a folder.
  • Symetry handling:

    • Symmetry handling for fitting coordinates.
    • Improve symmetry handling for calculation coordinates (e.g. ligands). [x] - Simple chain mapping feature.
  • New algorithms:

    • Modularity-based (Newman J. 2003)
    • Passing messages (Frey and Dueck 2007)
    • Flow simmulation (Stijin van Dongen)
    • Fuzzy Clustering
    • Jarvis-Patrick Algorithm
    • Others (adaptative spectral clustering flavours)
  • New quality functions.

    • Balancedness: The sizes of the clusters must be balanced.
    • J quality function: Cai Xiaoyan Proceedings of the 27th Chinese Control Conference
    • Metastability function (Q) in Chodera et al. J. Chem. Phys. 126 155101 2007 .
    • Improve separation quality functions.
    • New standard separation ICVs (require inmutable prototypes)
      Separation, the clusters themselves should be widely spaced. There are three common approaches measuring the distance between two different clusters:
      -  Single linkage: It measures the distance between the closest members of the clusters.
      -  Complete linkage: It measures the distance between the most distant members.
      - Comparison of centroids: It measures the distance between the centers of the clusters.
      
  • New features:

    • Refine noise in DBSCAN
    • Refine a preselected cluster (e.g "noise"), or "heterogeneous".
  • New postprocessing options:

    • Refinement
    • Kinetic analysis

About

pyProCT is an open source cluster analysis software especially adapted for jobs related with structural proteomics. Its approach allows users to define a clustering goal (clustering hypothesis) based on their domain knowledge. This hypothesis will guide the software in order to find the best algorithm and parameters (including the number of clus…

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published