Skip to content

martaferri/SBI_Project

Repository files navigation

Tutorial

Available at:

https://github.com/martaferri/SBI_Project

Prerequisites

  • Python 3.0 Release

https://www.python.org/download/releases/3.0/

  • Modeller 9.19 Release

https://anaconda.org/salilab/modeller

It is also recommended to have a program for interactive visualization and analysis of molecular structures and related data, such as Chimera or VMD.

https://www.cgl.ucsf.edu/chimera/

http://www.ks.uiuc.edu/Research/vmd/

Input files

The input files must be pairs of interacting chains (.pdb), which has to be located into a directory decided by the user. This directory can also be in tar.gz format.

In case you are interested in rebuilding a macrocomplex, you can introduce its pdb file (.pdb) from PDB."

Python modules

A module is a file containing Python definitions and statements. It is important to consider that definitions from a module can be imported into other modules or into the main module.

  • sbi_project.py: main module or program created to reconstruct a macrocomplex given a set of interacting pairs (prot-prot, prot-RNA). It also considers the possibility to rebuild a macrocomplex if the input is a macrocomplex (.pdb).

    Aditionally, this module contains the ArgumentParser object, created using argparse module, which is used for command-line options, arguments and sub-commands. The ArgumentParser object will hold all the information necessary to parse the command line into Python data types.

  • get_interactions.py: this module is only used by the main module if the given input is a macrocomplex (.pdb). It gets all possible interactions given a this structure and computes the distance between chains parining that ones which accomplish the following conditions: less than 8A between their carbons alpha (CA) and implication of at least 8 CA in this interaction. It creates the pair files (.pdb) which later, in the main program will be filtered by the imported module named reduce_inputs_func.py to get only the non-redundant interactions to build the model.

  • reduce_inputs_func.py: this module is imported into the main module to speeds up the further process to create the model, as it is getting only the non-redundant interactions from the whole set of input pairs. From a set of input pairs (.pdb) compares each chain pair with the rest. This comparision has two steps: sequence and structural. First of all, a pairwise alignment is performed to determine similar sequences (cut-off = 0.9). The score of the alignment is normalized by the length of the longest sequence. If the normalized score is higher than the stablished cut-off, the analysis proceeds to the second step. In this step, a superimposition is performed between the similar chains. These similar chains are part of two different interaction pairs, which will be refered as fixed and moving chains. We apply the rotran matrix to the couple of the moving chain, which will be refered as alternative/new chain. Finally, the distances between the CA of the comparing chain (couple of the fixed chain) and the new chain are computed. If the distance is lower than 9A, the interactions will be considered the same (redundant). It saves the non-redundant chain pairs (.pdb) obtained from the reducing inputs process, which will be considered as the required input pairs to build the model.

  • functions.py: this module is composed by a set of different functions to solve biological and technical problems during the analysis. Thus, it is imported into the other modules in order to use the defined functions.

  • utilities.py: this module is composed by a set of different variables to solve biological and technical problems during the analysis. Thus, it is imported into the other modules in order to use the defined functions.

  • classes.py: this module is composed by the definition of a class to check if the input variable is a directory or a compressed directory (.tar.gz).

  • DOPE_profile.py: this module is only used if the argument '-e', '--energy_plot' is set. It creates a DOPE profile plot (.jpg) from a macrocomplex (.pdb), which has no acid nucleic chains using Modeller.

  • DOPE_comparison.py: this module is only used if the argument '-ref', '--refine' is set. It refines a model previously generated with the main program according to the optimization parameters defined by Modeller. It also generates a comparison energy plot between the model generated by the program and the refined one.

Output files

It is necessary to consider that the output files will depend on the command-line options and arguments the user determines while performing the analysis. All of them will be stored in the output directory selected by the user (-o or --output).

The arguments considered to fill the ArgumentParser object can be achieved by:

$ python3 sbi_project.py -h (or --help)

The following arguments are stablished:

  • -i INDIR, --input INDIR

    It must be a directory provided by the user which contains the inputs pairs (compressed format is also available .tar.gz). In case you want to rebuild a macrocomplex from PDB, you can introduce its pdb file (.pdb).

    (default: None)

  • -o OUTDIR, --output OUTDIR

    This is a directory which will be created during the program, structured in other subdirectories.

    (default:None)

  • -v, --verbose

    This option will allow the user to completely follow the progam.

    (default: False)

  • -cc {0.25,0.5,1,1.2,1.5}, --clash_distance_cutoff {0.25,0.5,1,1.2,1.5} Choose a cut-off distance to detect clashes between chains.

    (default: 1.2)

  • -e, --energy_plot
    DOPE profile of the macrocomplex (not including nucleic acid chains) generated by Modeller.

    (default: False)

  • -ref, --refine
    Refines a model previously generated with the program according to the optimization parameters defined by Modeller.

    (default: False)

  • -ver {short,intensive}, --version {short,intensive} Short version only considers the first model the program can build, while Intensive version considers all the possible models the program can build depending on the input pair it starts the construction of the model.

    (default: short)

If the default options are set, these are the following outputs:

files:

  • alignments_results.txt: file genereated from the pairwise comparisons between all the inputs to create a dictionary of equivalences.

  • unique_chains_fasta.mfa: multifasta file containing the unique chains to allow the user to introduce the stoichiometry of the macrocomplex.

files, directories and subdirectories(->):

  • models (directory) -> 1 (directory) -> .pdb: the generated model.

  • reduced_inputs (directory) : non-redundant interactions pdb files obtained with the reduction of inputs process.

(*) If -ver, --version intensive, inside the models directory all the possible models are created inside each subdirectory named with the number of the model.

(*) In case the input is a pdb file (from a macrocomplex from the PDB), another subdirectories named 'get_interactions_results' is created with all the possible interactions pairs, which will be lately filtered by the reduction of the inputs process.

Additionally to these previously mentioned outputs, if -e, --energy_plot is set, the following outputs are generated:

directories and files:

  • models (directory) -> 1 (directory) -> dope_profile.jpg (image of the plot) and .pdb.profile (DOPE profile file).
  • temp (directory): temporary directory containing the pdb files without the acid nucleic chains, in order to that use them in the refining process, once the model has been built (to avoid Modeller errors).

Additionally to these previously mentioned outputs, if -ref, --refine is set, the following outputs are generated:

directories and files:

  • models (directory) -> 1 (directory) -> .pdb = generated model and .pdb.B = refined.
  • optimization_results (directory) -> dope_profile, refined_models, stats and log_files (subdirectories). Inside dope_profiles, an image named '.pdb.dope_profile.jpg' comparing both models is created.
  • temp (directory): temporary directory containing the pdb files (.pdb = generated model, .pdb.B = refined) without the acid nucleic chains, in order to that use them in the refining process, once the model has been built (to avoid Modeller errors).

Biological considerations

During the development of this project, we've had to take into account some specific details in order to elaborate functions that considered the possible biological parameters and properly build the models.

Distinguishing the type of the chain

The inputs given to the program may be of different nature: protein or nucleic acid. These types of chains are composed by different residues and atoms. To be able to work with both of them, we had to make some adaptations. Those are the functions involved in this matter:

  • Labeling:
    • check_type(): First of all, we made a function to determine the type of the chain, this function checks if the chain has carbons alpha, if it does, we will consider it a protein, if it doesn't, a nucleic acid. Returns a string which the label.
  • Calculating distances:
    • get_seq_from_pdb(): As the residues of a protein are in three format letter, while the nucleotides are in two, or one, we had to get in a different way the sequences from the pdb file. We used a function in the protein case (three_to_one()), and getting the last letter by indexing in the nucleic acid case.
    • get_atoms_list(): The superimposing step recquires the use of atom lists. We decided to consider just the CA atoms in protein, while P atoms in nucleic acid.
    • calc_distances_residues(): To get the interactions, we compared by distances between CA atoms. As mentioned before, nucleic acids don't have this kind of atoms. We elaborated a function that renames the C1' atoms from the nucleic acid chains to CA (adapt_chains()). This way, we could compute the distances the same way as we would do if we only had one type of chain.

Superimposing

When superimposing two atoms lists, those lists have to contain the same number of atoms. Usually, arriving to this step means that the sequences of the superimposig chains are equal. However, there are cases when the chains are similar, but not exactly the same, they have different number of atoms. We handled this creating two functions:

  • Comparing the chains:
    • refine_for_superimpose(): The alignment of the sequences from the two chains will reveal its differences. We obtained the sequences with a function mentioned above, then made the pairwise alignment. Based on the ouput of this tool, we obtained a pattern of 1 and 0 that revealed the positions with matching or different residues.
  • Creating the new chains:
    • get_chain_refined(): Once we obtained a pattern, we knew that the number of matches would be the same for both chains. This function creates a new chain that contains only the matching residues. This allowed us to finally obtain lists with the same number of atoms.

Reducing the inputs

One initial step of the program is to analyse the chain interactions given by the user, and determine if there are redundant pairs to be able to reduce the inputs and build the models more fluently. This is done in the reduce_inputs.py script. For this purpose, the program makes pairwise comparisons to detect similar sequences, and if there is similarity, those chains enter a superimposition step where the structural similarity will be tested.
To decide if the sequence similarity between the tested sequences was high enough, we set a treshold of 0.90. This was based on the distribution of the obtained scores, which showed extreme values. Setting the treshold at this point, ensured that the actual similar sequences would obtain a greater score.

Dealing with chain ids

Biopython model object doesn't allow containing chains with the same id. To avoid this we decided to give new ids to the chains from the input to handle them without errors during the program. We used numerical annotation to deal with big numbers of chains. This solved the "same id issue" when working with objects, but to save the model in PDB format, the chain id must be of just one character.
When a chain is added to the current model, its id is changed again. At this point, the new id is obtained from a list of ASCII characters (ascii_list) located in the utilities.py script. This last change of id allowed us to handle the saving of the created model in PDB format, but the ASCII characters list is limited, therefore if a macrocomplex is formed by more than 83 chains, we have to create a new model to continue adding chains without trouble. To sum up, if the macrocomplex has less than 83 chains, it will be created as one model and saved in one single file, but if it doesn't, the protein will be created in more than one model and saved splitted in different files. Besides avoiding biopython errors during the program, having two or more files for one big structure avoids issues in Chimera when labeling chains.


Analysis of some of the reconstructed macrocomplexes

Hemoglobin (1gzx)

1gzx Oxy T state haemoglobin: oxygen bound at all four haems (https://www.rcsb.org/structure/1gzx)

Hemoglobin is an iron-containing oxygen-transporter metalloprotein. Forms an hetero tetramer composed by 2α (A, C) and 2β (B, D) chains. This entry corresponds to an Homo sapiens oxy T state hemoglobin, so we will find four oxygens bound to the hemo groups.

Trough our sequence similarity analysis based on a pairwise alignment we confirmed its stoichiometry and 2 unique chains were determined. Additionally, by the structural similarity analysis we achieve a non-redundant interaction set of two pairs out of an initial set of four pairs (from getting all the possible interactions by distance restrictions).

Nucleosome (3kuy)

Nucleosome is a basic unit of DNA packaging in eukaryotes, which consists in a segment of DNA wound around a histone hetero octamer. The structure is formed by two copies of core histones H2A (C, G), H2B (D, H), H3 (A, E) and H4 (B, F) and two DNA chains (I, J). We analyzed this structure in two ways: one using the histone octamer and another one using the whole complex.

Trough our sequence similarity analysis based on a pairwise alignment we confirmed its stoichiometry and 5 unique chains were determined, one of them are protein sequences and the other two are DNA sequences. Additionally, by the structural similarity analysis we achieve a non-redundant interaction set of 16 pairs out of an initial set of 24 pairs (from getting all the possible interactions by distance restrictions).

Ribosome (4v4a)

4v4a entry corresponds to an Escherichia coli 70S ribosome. It consists on an hetero 49-mer formed by two subunits: 50S (large) and 30S (small). 50S is composed of 30 protein chains and a two rRNAs (5S and 23S), while 30S is formed by 19 protein chains and an rRNA (16S).

Trough our sequence similarity analysis based on a pairwise alignment we confirmed its stoichiometry and 20 unique chains were determined, all of them are protein sequences and the other is RNA. Additionally, by the structural similarity analysis we achieve a non-redundant interaction set of 27 pairs out of an initial set of 27 pairs (from getting all the possible interactions by distance restrictions).

An important trait of this structure is that the RNA molecule is interacting with almost all the protein chains, so, to reconstruct the model we had to reduce the clash cut-off distance to 0.5A. (argument need to be set to -cc 0.5).

Phosphatase (2f1d)

2f1d entry corresponds to an Arabidopsis thaliana imidazoleglycerol-phosphate dehydratase, an enzyme of histidine biosynthesis. The structure is composed of 24 identical subunits and which form a dimanganese cluster crucial for its activation.

Trough our sequence similarity analysis based on a pairwise alignment we confirmed its stoichiometry and one unique chains were determined. Additionally, by the structural similarity analysis we achieve a non-redundant interaction set of 3 pairs out of an initial set of 26 pairs (from getting all the possible interactions by distance restrictions).

An important trait of this structure is that all the protein chains were closely positioned, so, to reconstruct the model we had to reduce the clash cut-off distance to 0.5A. (argument need to be set to -cc 0.5).

At this image, we can see that the pdb structure from the PDB file (2f1d) is divided into two structures, while in our model it its completely reconstructed.

Proteasome (1g65)

1g65 PDB entry corresponds to an Saccharomyces cerevisiae 20S core particle from proteasome interacting with epoxomicin, an inhibitor. Its function is to degrade unneed or damaged proteins by proteolysis. Its structure is an hetero 28-mer, formed by 14 pairs of components (Y7, Y13, PRE2, PRE3, PRE4, PRE5, PRE6, PUP1, PUP2, PUP3, C1, C5, C7α, C11) and two chains of the epoxomicin.

Trough our sequence similarity analysis based on a pairwise alignment we confirmed its stoichiometry and 15 unique chains were determined. Additionally, by the structural similarity analysis we achieve a non-redundant interaction set of 36 pairs out of an initial set of 70 pairs (from getting all the possible interactions by distance restrictions).

An important trait of this structure is that all the protein chains were closely positioned, so, to reconstruct the model we had to reduce the clash cut-off distance to 0.5A. (argument need to be set to -cc 0.5).

Energy analysis comparision (Modeller): Comparision between the model and the refined model

After the optimization and the DOPE assessment performed by the MODELLER functions, we can obtain this DOPE score plot, where we can compare the energy profile of the model before and after the optimization. In this plot, we don’t see big changes between both models, but the slight changes made in the optimized model have lower DOPE scores values. Changes in the optimized model are slight because restrictions applied are subtle, and this is reflected into the DOPE profile plot.

The quality of the DOPE plots varies depending on the structures optimized and analyzed. In models that include peptide chains and nucleic acids we observe profiles with plateaus of value 0 in the positions corresponding to the nucleic acid. To overcome this error, our program removes the nucleic acid atoms from the optimized and the non-optimized models files. However, the DOPE profile of both models seems to be shifted, so at the moment we can’t rely on it. One of the further intentions of the project is to improve the overall optimization process, combining MODELLER with other available softwares.

An example of this analysis is shown here with the hemoglobin reconstruction from 1gzx pdb file:

Limitations

After our analysis with some examples, we realized that our program has some limitations dealing with macrocomplex composed by more subunits such as ATPasa (5vox) or Full virus map of brome mosaic virus (3j7l).

ATPase (5vox)

V-ATPase are found in the eukaryotic endomembrane system and hydrolase the ATP to drive a proton pump. V-ATPase is an hetero 33-mer composed by two domains, V0 and V1, each one composed respectively by 5 and 8 subunits. When trying to reconstuct this macrocomplex, some of the chains which only interact with one of a set of equal chains were wrongly placed. It was depending on the input it takes as a reference to start building the model. Moreover, the clush distance cut-off has to be also lower to 0.5.

Brome mosaic virus (3j7l)

Brome mosaic virus is a small positive-stranded, icosahedral RNA plant virus, which is formed by 3 quasi-equivalent subunits: A, which form a pentameric structures, B and C, which compose hexameric capsomeres. In this case, as it has a total of 180 subunits, and the number of strings to assign to each chain was limited to 83, we had to build different models and checking the clashes of each new added chain to the previously created models. Finally, we were almost able to reconstruct it but it takes a long time.


Authors

About

Structural Bioinformatics final project

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages