Skip to content

neksa/Descriptor_Calculator

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Descriptor of Elementary Function

Workflow

Preprocessing

Unless otherwise stated, functions come from src/preprocessing/preprocess.py.

  1. Extract seq-name and chain-ID from source extracts

    • Input:
      1. Website extract files (e.g. data/user/input/prosite_extract.txt)
    • Output:
      1. data/internal/pname_cid_map.pkl
    • Description:
      1. Place prosite or ioncom extract in /data/user/input/.
      2. Run parse_extracts(source, filename) in preprocessing/preprocess.py, specifying the source (ioncom or prosite) and name of extract file. This extracts the sequence names and chain-ids to be processed.
  2. (Optional) Download relevant .pdb files from rscb server

    • Input:
      1. data/internal/pname_cid_map.pkl
      2. Internet connection
    • Output:
      1. Populated data/internal/pdb_files/
    • Description:
      Downloads corresponding .pdb files from rscb server. Delete entries in pname_cid_map if .pdb files are not in folder.
      1. Run download_pdb()
      2. Run trim_pnames_based_on_pdb()
  3. Create sequence .fasta file

    • Input:
      1. data/internal/pname_cid_map.pkl
    • Output:
      1. data/internal/seqs.fasta
    • Description:
      The motif-finding binaries require the sequences to be in a .fasta file.
      1. Run create_seq()
  4. Filter short sequences

    • Input:
      1. data/internal/seqs.fasta
      2. Populated data/internal/pdb_files/
    • Output:
      1. Updated data/internal/seqs.fasta
    • Description:
      Sequences shorter than the desired motif length (30 residues) can lead to errors when performing the motif search, and need to be dropped.
      1. Run filter_seq_file()
  5. (Optional) Create seed sequence file for converge

    • Input:
      1. data/user/input/ioncom_binding_sites.txt
    • Output:
      1. data/internal/seed_seqs.fasta
    • Description:
      The motif-finding binary converge requires seed sequences from which it generates its initial set of motifs.
      1. Place ioncom binding-site file in /data/user/input/.
      2. Run make() in src/preprocessing/make_conv_seed_seqs.py.
  6. Run motif-search binary to find motif positions

    • Input:

      1. data/internal/seqs.fasta
      2. (Optional) Populated data/internal/pdb_files/
      3. (Optional) Provided motif file (e.g. data/user/input/meme.txt)
      4. (Optional) data/internal/seed_seqs.fasta
    • Output:

      1. data/internal/motif_pos.pkl
    • Description:
      This finds the positions of the desired motif for each sequence-chain. There are three implemented ways of running this locally:

      1. Motifs can be derived from scratch, using meme. This generates both the motif file and the motif positions. Run find (process='meme', num_p=<num_processors>) in src/preprocessing/motif_finder.py.
      2. Motifs can be found using a given motif file. First, put the motif file (in MEME format) in data/user/input/<filename>. Then, run find (process='mast', motif_fname=<filename>, num_p=<num_processors>) in src/preprocessing/motif_finder.
      3. Motifs can be derived from scratch using converge, which also provides the motif file and positions. Run make (input_fname=<filename>, num_p=<num_processors>) in src/preprocessing/make_conv_seed_seqs.py.

      Because of long run-time for the motif-finding process, it is recommended to run this step in a server. Instructions for doing so are in [1] below.

Descriptor Generation

  1. Calculate descriptor properties

    • Input:
      1. data/internal/motif_pos.pkl
      2. Populated data/internal/pdb_files/
    • Output:
      1. data/internal/descrs.pkl
    • Description:
      This calculates the descriptor properties, for each motif. Run calculate() in src/descr/descr_main.py.
  2. Visualise properties

    • Input:
      1. data/internal/descrs.pkl
    • Output:
      1. (Optional) data/user/output/
    • Description:
      Plots for different descriptor properties can be generated via src/utils/plots.py. Run each plot_<something>(save=False) as needed, and set save=True to keep the generated plots in the output folder.

Tests

  • Generate Reference Output

    • /tests/src/setup_ref.py
  • Visualise Reference Output

    • /tests/src/plot_ref.py
  • Checks against reference output

    • /tests/src/test_motif_finder.py
    • /tests/src/test_descr_main.py

Data files

  • /data
    • /tmp: created during runtime, should be deleted at end of run, except for debugging. Does not get deleted for tests that fail.
    • /input
      • /ioncom
        • allsulfate.txt: Raw sequence-binding_site match, for mg, in dataIonCom.zip, downloaded from https://zhanglab.ccmb.med.umich.edu/IonCom/ >> download dataset used to...
        • ioncom.txt: allid_reso3.0_len50_nr40.txt in dataIonCom, shows list of sequences. (deprecated eventually)
      • /mg_full
        • mg_50.fasta: From uniprot, uniref50 for seqs with MG as co-factor/ligand.
        • mg_100.fasta: uniref100 for MG cofactor seqs
      • /pdb_files: Stored pdb_files. Both tests and main should call this, since downloading takes a while. Automatically downloaded from rscb server, via link https://files.rcsb.org/view/{1ABC}.pdb
      • /prosite
      • /internal
        • fasta_template.fasta: Used for running mast, when we only want the seqlogo and doesn't actually care about matching for motifs.
        • meme.txt: Motif file for Calcium EF-hand.

Linter

  • pylint, mostly following google style guide with some additional disabled clauses.

TODO:

  1. pdb_list from prosite need to be extracted too...?

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 86.7%
  • C++ 9.3%
  • C 3.8%
  • Other 0.2%