GitHub - SimonCB765/Leaf: Code for the Leaf independent set finding algorithm

SimonCB765 / Leaf Public

Notifications You must be signed in to change notification settings
Fork 1
Star 2

Code for the Leaf independent set finding algorithm

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
BlastExecutables		BlastExecutables
Leafcull.py		Leafcull.py
PDBcontroller.py		PDBcontroller.py
README		README
__init__.py		__init__.py
checkfastaformat.py		checkfastaformat.py
performBLAST.py		performBLAST.py
processPSIoutput.py		processPSIoutput.py
userseqcontroller.py		userseqcontroller.py

Repository files navigation

General Program Notes
The Leaf source code can be downloaded from either the Leaf website (http://leaf-protein-culling.appspot.com/code_and_PDB) or the GitHub repository where it normally resides (https://github.com/SimonCB765/Leaf).
Data for culling PDB chains locally can also be downloaded from the Leaf website (http://leaf-protein-culling.appspot.com/code_and_PDB).
A paper describing the Leaf algorithm, along with comparisons to other redundancy removal algorithms, can be found at http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0055484.

Requirements For Using Leaf
Python is needed to run Leaf. The code was designed for Python 3.3, and should need no modification to work with any prior 3.x version. Minor syntactic changes may be needed for it to work with 2.x versions.
Python can be downloaded from http://www.python.org/getit/.
BLAST+ executables are required in order to cull a dataset in fasta file format. While only the makeblastdb and psiblast executables are needed, the entire BLAST+ suite can be placed in the directory.
The files can be downloaded from ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/, making sure to download the version suitable for your system.
The downloaded executables should be placed in the BlastExecutables directory within the directory where this README is found.
BLAST+ version 2.2.27+ was used to develop Leaf, but providing there are no major modifications to the output formats no changes should be needed to get the code to work with newer versions.
The code for the Leaf program was developed using Windows 7, but should work without modification on Linux distributions.
If you are using Windows XP, then it is likely that you will have to change all slashes '/' in file paths to double backslashes '\\'.

General Leaf Usage Notes
For general usage notes run the program using the -h or --help flags. For example, 'python PDBcontroller.py -h' or 'python userseqcontroller.py -h', without the enclosing ''.
Basic usage for non-PDB culling: python userseqcontroller.py /path/to/your/fasta/file
Basic usage for culling all downloaded PDB chains: python PDBcontroller.py /path/to/PDB/data/directory
Basic usage for culling a subset of the downloaded PDB chains: python PDBcontroller.py /path/to/PDB/data/directory -i /path/to/your/file/of/chains

If no output location is specified, the directory containing the results of the culling will be placed within the directory that the culling program was called from.
The results directory will overwrite any existing directory with the same name.

Using Leaf to Cull the PDB
The expected format for the file containing the PDB chains to cull is to have one chain on a line, with no extra whitespace on the lines.
In order to consider every downloaded PDB chain in the culling, make sure to include non-X-ray chains, alpha carbon only chains, set the maximum resolution to 100 and the R value to 1.

If you've downloaded a set of chains from the server, and wish to cull all of them, then there is no reason to supply an input file (using the -i flag).
Assuming you data files are saved in the directory /path/to/PDB/data/directory, then simply call python PDBcontroller.py /path/to/PDB/data/directory.

The data files in the directory passed to the PDBcontroller.py script are treated as the entire database of chains available to be culled.
Therefore, if you supply an input subset of chains to cull (using the -i flag) that contains chains not in the data files, then those chains are simply ignored in the culling.
For example, if you call python PDBcontroller.py /path/to/PDB/data/directory -i /path/to/your/file/of/chains and the file /path/to/your/file/of/chains contains a chain XXXX that
is not in the file of chains in the /path/to/PDB/data/directory directory, then XXXX is not considered in the culling.

Using Leaf to Cull User Supplied Sequences
When culling your own sequence in FASTA format the expected format is as follows:
>Protein A Identifier
SEQUENCEFORPROTEINAONONELINE
>Protein B Identifier
SEQUENCEFORPROTEINBONONELINE
The protein identifier must all be on one line, and the line must start with a '>'. Any number of characters can follow the '>', and there are no restrictions on the characters that are permitted.
The protein identifier is taken to be every character after the '>' up to, but not including, the end of line character.
The sequence for the protein is expected to be in all capital letters, and occur on the first line after the identifier line. However, the sequence can occur over multiple lines and contain lower case letters. The input FASTA file will simply be converted to the expected FASTA file format.

It is possible that your antivirus may interfere with the calling of the psiblast executable.
If this happens, then change the last line of performBLAST.py from
subprocess.call(argsPSI, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
to
subprocess.call(argsPSI)
This should correct the problem, but will cause all messages from PSI-BLAST to display all warnings.
Similarly, if for any reason the calling of the psiblast executable appears to hang, this change may fix the problem, or cause the real issue to be made clear.

Results Format Notes
PDB Results
CulledList.txt
A list of the redundant chains, one chain per line.
KeptList.txt
A list of the non-redundant chains, one chain per line.
User Culling Results
Removed.txt
A list of the redundant input proteins.
KeptList.txt
A table of information about the non-redundant input proteins. The two columns are separated by tabs ('\t').
The first column contains protein identifiers, and the second contains the length of the proteins.
KeptFasta.fasta
A FASTA fromat file of the non-redundant input proteins.

Data Collection Notes
If 50% or more of the amino acids in an amino acid sequence are X (i.e. unspecified or unknown), then the entity is marked as a NonProtein, and not included in the list of proteins.
When determining the value for the resolution:
Only one of the _refine and _reflns records is used (with preference given to the _refine record).
If there is no value present for the resolution, then it is set to 100.
Similarity if there is no value present for the R-value or the R-free value, then these values are set to 1.
When determining the organism which the entity originates from:
Only one of _entity_src_gen, _entity_src_nat or _pdbx_entity_src_syn is used (with preference given in the order _entity_src_nat, _entity_src_gen and finally _pdbx_entity_src_syn).

_entry
_entry.id
Used to determine the name of the PDB entry (e.g. 3A0B).
_entity
_entity.id
The numeric identifier used to identify the specific entity within the entry record.
_entity.pdbx_description
A description of the entity.
_entity_poly
_entity_poly.entity_id
Used to determine the entity that the rest of the _entity_poly information pertains to.
_entity_poly.type
Used to determine the type of the entity (e.g. polypeptide(L)).
_entity_poly.pdbx_seq_one_letter_code_can
Used to determine the nucleotide or amino acid sequence of the entity.
_entity_poly.pdbx_strand_id
Used to determine the single character codes corresponding to the entity (the 'a', 'A', 'b', '1', etc. that comes after the entry e.g. 3A0Ba, 3A0BA, ...).
_exptl
_exptl.method
The experimental method that was used to determine the structure.
_atom_site
_atom_site.label_atom_id
Records the type of atom. Used to determine if the entity contains only alpha carbon atoms.
_atom_site.label_entity_id
Used to determine the entity that the rest of the _atom_site information pertains to.
_refine
_refine.ls_d_res_high
Used to determine the resolution of the structure in an X-ray diffraction experiment.
_refine.ls_R_factor_obs
Used to determine the R-value. This is the R-value measurement used when giving an upper bound on the R-value during culling.
_refine.ls_R_factor_R_free
Used to determine the R-free value.
_reflns
_reflns.d_resolution_high
Used to determine the resolution of the structure in an X-ray diffraction experiment.
_struct_ref
_struct_ref.entity_id
Used to determine the entity that the rest of the _struct_ref information pertains to.
_struct_ref.db_name
Used to determine the name of the external database linked to the entity.
_struct_ref.db_code
Used to determine the code which can be used to access information about the entity in the external database
_entity_src_gen
_entity_src_gen.entity_id
Used to determine the entity that the rest of the _entity_src_gen information pertains to.
_entity_src_gen.pdbx_gene_src_scientific_name
Used to determine the scientific name of the organism that the entity came from.
_entity_src_nat
_entity_src_nat.entity_id
Used to determine the entity that the rest of the _entity_src_nat information pertains to.
_entity_src_nat.pdbx_organism_scientific
Used to determine the scientific name of the organism that the entity came from.
_pdbx_entity_src_syn
_pdbx_entity_src_syn.entity_id
Used to determine the entity that the rest of the _pdbx_entity_src_syn information pertains to.
_pdbx_entity_src_syn.organism_scientific
Used to determine the scientific name of the organism that the entity came from.