MetamORF: A repository of unique short Open Reading Frames identified by both experimental and computational approaches for gene-level and meta analysis
This GitHub repository contains the instructions and material to reproduce the build of the MetamORF database (web interface available at http://metamorf.hb.univ-amu.fr). Extensive documentation, source code, scripts and containers are available in this repository. Builded Docker/Singularity images are available on download and required data may be downloaded from their original sources. Instructions necessary to reproduce the analysis are provided below.
To build the database, you first need to prepare the environments and then follow the steps described there.
In order to reproduce the build of MetamORF, it is first necessary to download data from the 6 original data sources. Data have to be downloaded manually from the editors' or authors' websites.
Name of the data source | Species | Associated publication | doi | URL to the publication | Description |
---|---|---|---|---|---|
Erhard2018 | H.sapiens | Erhard et al., Nat. Meth., 2018 | 10.1038/nmeth.4631 | https://www.nature.com/articles/nmeth.4631 | "Supplementary Table 3: Identified ORFs (Union of all ORFs detected either by PRICE,RP-BP or ORF-RATER, or contained in the annotation (Ensembl V75))". The two first lines of the file have to be manually removed. |
Johnstone2016 | H.sapiens | Johnstone et al., EMBO, 2016 | 10.15252/embj.201592759 | http://emboj.embopress.org/content/35/7/706.long | "Dataset EV2: Location and translation data for all analyzed transcripts and ORFs in human" |
Johnstone2016 | M.musculus | Johnstone et al., EMBO, 2016 | 10.15252/embj.201592759 | http://emboj.embopress.org/content/35/7/706.long | "Dataset EV3: Location and translation data for all analyzed transcripts and ORFs in mouse" |
Laumont2016 | H.sapiens | Laumont et al., Nat. Commun., 2016 | 10.1038/ncomms10238 | https://www.nature.com/articles/ncomms10238 | "Supplementary Data 2: List of all cryptic MAPs detected in subject 1. Table presenting the genomic and proteomic features of all cryptic MAPs". The two first rows have to be manually removed. |
Mackowiak2015 | H.sapiens | Mackowiak et al., Genome Biol., 2015 | 10.1186/s13059-015-0742-x | https://genomebiology.biomedcentral.com/articles/10.1186/s13059-015-0742-x | "Additional file 2: Table S1. All sORF information for human". The file header has to be removed manually (45 first rows). |
Mackowiak2015 | M.musculus | Mackowiak et al., Genome Biol., 2015 | 10.1186/s13059-015-0742-x | https://genomebiology.biomedcentral.com/articles/10.1186/s13059-015-0742-x | "Additional file 3: Table S2. All sORF information for mouse". The file header has to be removed manually (45 first rows). |
Samandi2017 | H.sapiens | Samandi et al., eLIFE, 2017 | 10.7554/eLife.27860 | https://elifesciences.org/articles/27860 | "Homo sapiens alternative protein predictions based on RefSeq GRCh38 (hg38) based on assembly GCF_000001405.26. Release date 01/01/2016". The TSV file has been used. |
Samandi2017 | M.musculus | Samandi et al., eLIFE, 2017 | 10.7554/eLife.27860 | https://elifesciences.org/articles/27860 | "Mus musculus alternative protein predictions based on annotation version GRCm38. Release date 01/01/2016". The TSV file has been used. |
sORFs_org_Human | H.sapiens | Olexiouk et al., Nucl. Ac. Res., 2018 | 10.1093/nar/gkx1130 | https://academic.oup.com/nar/article/46/D1/D497/4621340 | H. sapiens database downloaded from sORFs.org using the Biomart Graphic User Interface. The following parameters were used to query the database: "Homo sapiens" > "no filters" > "select all MAIN_ATTRIBUTES" > "results" > "download data". |
sORFs_org_Mouse | H.sapiens | Olexiouk et al., Nucl. Ac. Res., 2018 | 10.1093/nar/gkx1130 | https://academic.oup.com/nar/article/46/D1/D497/4621340 | M. musculus database downloaded from sORFs.org using the Biomart Graphic User Interface. The following parameters were used to query the database: "Mus musculus" > "no filters" > "select all MAIN_ATTRIBUTES" > "results" > "download data". |
The source code must be executed with Python 2.7. The MetamORF database has been build on a Linux system using Docker and Singularity containers, but it could be run on any operating system where dependencies are satisfied. A minimum of 62 GB of RAM memory, 12 threads is highly recommended to run the software but we highly recommend to use systems with at least 40 threads, 192 GB of RAM available if you intend to build the full database. A stable connection to the Internet is also required as some information are queried from different databases accessible online.
In order to prepare the environment for analysis execution, it is required to:
- Clone the current GitHub repository and set the
WORKING_DIR
environment variable - Download the data sources and the cross-references
- Download the Docker image (.tar.gz) and Singularity image (.img) files
- Install Docker, [Docker-compose](https://docs.docker.com/compose/ and Singularity
- Load the Docker images on your system and start the containers
This section provides additional information for each of these steps.
Use you favorite method to clone this repository in a chosen folder (see GitHub documentation for more information). This will create a folder called MetamORF
containing all the code and documentation.
Then, set an environment variable called WORKING_DIR
with a value set to the path to this folder. For instance, if you cloned the repository in /home/choteaus/workspace
, then the WORKING_DIR
variable needs be set to /home/choteaus/workspace/MetamORF
.
On Linux:
export WORKING_DIR=/home/choteaus/workspace/MetamORF
Cross-references
Cross-references have been downloaded manually from the HGNC and NCBI websites. Cross-references from the HUGO Gene Nomenclature Committee (HGNC) have been downloaded using the graphic user interface. Cross-references for M.musculus from the National Center for Biotechnology Information (NCBI) may be download using the following command line (on Linux):
mkdir -p $WORKING_DIR/07_input/cross_references \
&& curl -o $WORKING_DIR/07_input/cross_references/mmusculus.gene_info.gz \
-O ftp://ftp.ncbi.nih.gov/gene/DATA/GENE_INFO/Mammalia/Mus_musculus.gene_info.gz \
&& gunzip $WORKING_DIR/07_input/cross_references/mmusculus.gene_info.gz
As these databases evolve quickly, a copy of the cross-reference files we used to build MetamORF is available on Zenodo () as a .tar.gz archive. On Linux, run the following command line to download the archive and uncompress it at the appropriate location:
wget https://zenodo.org/record/4014738/files/MetamORF_07_input.tar.gz?download=1 -O $WORKING_DIR/MetamORF_07_input.tar.gz \
&& tar xzvf $WORKING_DIR/MetamORF_07_input.tar.gz --directory $WORKING_DIR \
&& rm $WORKING_DIR/MetamORF_07_input.tar.gz
Name of the cross-reference | Species | Filename |
---|---|---|
HGNC | H.sapiens | hsapiens_HGNC.txt |
NCBI | M.musculus | mmusculus.gene_info |
Data sources
Raw data need to be downloaded manually from editors' or authors' website using information provided in the Description of the datasets section of the current documentation. Once downloaded, the files need to be saved in the $WORKING_DIR/07_input/ORF_datasources
folder and renamed using the following rules:
Name of the data source | Species | Filename |
---|---|---|
Erhard2018 | H.sapiens | hsapiens_Erhard2018.csv |
Johnstone2016 | H.sapiens | hsapiens_Johnstone2016.txt |
Johnstone2016 | M.musculus | mmusculus_Johnstone2016.txt |
Laumont2016 | H.sapiens | hsapiens_Laumont2016.csv |
Mackowiak2015 | H.sapiens | hsapiens_Mackowiak2015.txt |
Mackowiak2015 | M.musculus | mmusculus_Mackowiak2015.txt |
Samandi2017 | H.sapiens | hsapiens_Samandi2017.tsv |
Samandi2017 | M.musculus | mmusculus_Samandi2017.tsv |
sORFs_org_Human | H.sapiens | hsapiens_sORFs.org.txt |
sORFs_org_Mouse | M.musculus | mmusculus_sORFs.org.txt |
Once done, you should obtain the following subfolder structure:
07_input
βββcross_references
β βββ hsapiens_HGNC.txt
β βββ mmusculus.gene_info
βββ ORF_datasources
βββ hsapiens_Erhard2018.csv
βββ hsapiens_Johnstone2016.txt
βββ mmusculus_Johnstone2016.txt
βββ hsapiens_Laumont2016.csv
βββ hsapiens_Mackowiak2015.txt
βββ mmusculus_Mackowiak2015.txt
βββ hsapiens_Samandi2017.tsv
βββ mmusculus_Samandi2017.tsv
βββ hsapiens_sORFs.org.txt
βββ mmusculus_sORFs.org.txt
Docker image tar file and Singularity img files are available to download on Zenodo ().
If you intend to build yourself the containers, the dockerfile and their documentation are available in the 02_container
folder. Note that some links of the dockerfile may be obsolete and we strongly suggest to use the containers stored on Zenodo.
To download the containers from Zenodo, open a shell and execute the following commands to download the tarball file and untar it (on Linux):
rm -R $WORKING_DIR/02_container \
&& wget https://zenodo.org/record/4014738/files/MetamORF_02_container.tar.gz?download=1 -O $WORKING_DIR/MetamORF_02_container.tar.gz \
&& tar xzvf $WORKING_DIR/MetamORF_02_container.tar.gz --directory $WORKING_DIR \
&& rm $WORKING_DIR/MetamORF_02_container.tar.gz
These commands will replace the 02_container
folder and you should obtain the following subfolder structure:
02_container/
βββ mysql
βΒ Β βββ docker-compose.yml
βΒ Β βββ readme.txt
βββ script
βΒ Β βββ 01_crossreferences_download
βΒ Β βΒ Β βββ dockerfile
βΒ Β βΒ Β βββ readme.txt
βΒ Β βββ 03_orf_datasources_analysis
βΒ Β βββ dockerfile
βΒ Β βββ readme.txt
βββ src
βββ dockerfile
βββ readme_dev.txt
βββ readme.txt
βββ tagc-uorf-orf-datafreeze-src.img
You need to install Docker (v18.09), Docker-compose (v1.24) and Singularity (v2.5) on your system. Please, read their official documentation for more information.
-
To install Docker, follow the instructions here: https://docs.docker.com/engine/install/
-
To install Docker-compose, follow the instructions here: https://docs.docker.com/compose/install/
-
To install Singularity v2.6, follow the instructions here: https://sylabs.io/guides/2.5/admin-guide/
In order to build the MetamORF database, you must load the provided docker images onto your Docker. Docker must be installed on your system.
- Update the line 12 and 13 of the
02_container/mysql/docker-compose.yml
file. The~/MetamORF/data/MySQL
path needs to be changed by the place where you are willing to store the MySQL files. The lines 11 to 13 can eventually be removed if you do not intend to mount a volume on the container. - Eventually update the line 15 to 17 to update the MySQL user name (15), user (16) or root (17) password.
- Eventually update the ports used by the server. If you do not know which ports to use, we strongly suggest to keep the one defined there.
Load the MySQL Docker using one of the following command (on Linux):
docker load -i $WORKING_DIR/02_container/mysql/mysql:8.0.16.tar.gz
or
docker pull mysql:8.0.16
Load the Adminer Docker one of the following command (on Linux):
docker load -i $WORKING_DIR/02_container/mysql/adminer:4.7.17.tar.gz
or
docker pull adminer:4.7.17
Start the Docker using the command
cd $WORKING_DIR/02_container/mysql \
&& docker-compose up -d
Please, note that Adminer is not necessary to build the database and is only provided in order to allow the use and managment of the database through an user-friendly web interface. Then, more recent version of Adminer or other services (such as phpMyAdmin) may be used instead. You may also wish to run solely the MySQL Docker, without Adminer.
To run the source code, we advice to use the Singularity image, which do not require the root privileges to be used. Nevertheless, if you are willing to use Docker instead of Singularity, archives necessary to do so are available. In such case, you need to load the Docker image and create a new container. Please, read the documentation associated with the containers and the Docker official documentation for more information.
Other containers provided in this repository are not necessary to run the source code and to build the database. They have been created to allow the download of cross-references (from NCBI, Ensembl and MGI) and to perform an estimate of the proportion of short Open Reading Frames that are spliced in the each data source. Please, read their documentation for more information.
If you do not wish to use Docker and Singularity images, then you need to ensure the following dependencies are successfully installed on your system:
- Python 2.7, with packages:
- SQLAlchemy
- Pandas
- PyEnsembl
- PyBiomart
- PyLiftOver
- wget
- statistics
- BioPython
- mysql-connector-python
- pathos
- R, with packages:
- getopt
- devtools
- Bioconductor: ensembldb, AnnotationHub
- MySQL
- SQLite3
- MUSCLE (Multiple sequence alignment software)
- UCSC utils
- fetchChromSizes
- bedToBigBed
Please note that we highly recommend to use the Docker and Singularity images we provide in order to ensure the reproducibility of the results. Please see the official documentation of these sotwares and packages for more information regarding their installation.
An extensive documentation providing information about the tools provided by the MetamORF builder are available in the manual. This section provides the most important and some complementary information to build the MetamORF database. The source code needs to be run once for each species and will result in the creation of two databases (DS and PRO) for each.
The config file needs to be edited manually. Example of config files are available in the 04_config
folder.
The following lines need to be checked / updated:
-
In the
DATABASE
section:- The name of the DS database:
DS_DATABASE_NAME
. - The name of the PRO database:
PRO_DATABASE_NAME
. - The name of the species:
DATABASE_SPECIES
(Hsapiens
andMmusculus
allowed). - The IP and the port of the MySQL host:
DATABASE_HOST_IP
andDATABASE_PORT
(NB: SQLite database may also be used, see the manual for more information). - The database username and password:
DATABASE_USER_NAME
andDATABASE_USER_PASSWD
.
- The name of the DS database:
-
In the
GENE_LIST
andDATASOURCE
sections, all the occurrences of$WORKING_DIR
must be replaced by its actual value (i.e. by the absolute path to the working directory).
Numerous options may be set in the config file. Please read the manual for more information regarding this topic. If you do not know how to configure the config file, we advice to use this default setting.
The MetamORF H.sapiens and M.musculus databases have been build by sequentially running several strategies. These strategies have been run in the following order:
- DatabaseCheck
- Insertion
- LiftOver
- Merge
- ComputeMissingInfo
- ComputeRelCoord
- ComputeKozakContext
- AnnotateORF
- GenerateBEDContent
If you intend to run the strategies in a different order or to run other strategies, update the 03_workflow/datafreeze/full_build.sh
or create a new bash script in this folder. The 03_workflow/datafreeze/model.sh
may be used as a template to create a new script. Several other scripts are available in this folder if you are willing to run a particular strategy.
To get more information about the strategies available, their execution, the options that may be used, input and outputs, please read the manual available at pdf and html formats in 01_documentation/manual
. If you do not know which strategies to run, we advice to use the full_build.sh
script.
To build the database, move to the working directory and start the running script previously created.
You may used one of the following command line (on Linux):
singularity exec 02_container/src/tagc-uorf-orf-datafreeze-src.img \
03_workflow/datafreeze/full_build_min.sh \
--config $configfileName \
> 09_log/MetamORF.log
or
singularity exec 02_container/src/tagc-uorf-orf-datafreeze-src.img \
03_workflow/datafreeze/full_build.sh \
--config $configfileName \
--dbtype MySQL \
--dsdbname $DS_DB \
--prodbname $PRO_DB \
--dbhost $DB_HOST \
--dbport $DB_PORT \
--dbuser $DB_USER \
--dbpassword $DB_PASSWD
> 09_log/MetamORF.log
The variable $configfileName
needs to be replaced by the name of the config file.
The variables $DS_DB
, $PRO_DB
, $DB_HOST
, $DB_PORT
, $DB_PORT
, $DB_USER
and $DB_PASSWD
need to be replaced by the appropriate values. Note that the options to provide may depend on the strategy. This second command line allows to add information about the release in the Metadata table of the databases.
The same procedure should be followed for both H.sapiens and M.musculus, using the appropriate configuration files.
An user manual is available at the PDF and HTML format in 01_documentation/manual
and provides extensive information about the methods.
A documentation dedicated to developers is available in the 01_documentation/src/html
folder (generated with Doxygen). You need to open the index.html
file with a web browser to display and navigate through this documentation.
Data sources
Name of the data source | Species | Date of download |
---|---|---|
Erhard2018 | H.sapiens | 04/01/2019 |
Johnstone2016 | H.sapiens | 04/01/2019 |
Johnstone2016 | M.musculus | 04/01/2019 |
Laumont2016 | H.sapiens | 04/01/2019 |
Mackowiak2015 | H.sapiens | 20/03/2019 |
Mackowiak2015 | M.musculus | 20/03/2019 |
Samandi2017 | H.sapiens | 04/01/2019 |
Samandi2017 | M.musculus | 04/01/2019 |
sORFs_org_Human | H.sapiens | 08/06/2020 |
sORFs_org_Mouse | M.musculus | 08/06/2020 |
Cross-references
Name of the cross-reference | Species | Date of download |
---|---|---|
HGNC | H.sapiens | 27/06/2019 |
NCBI | M.musculus | 06/06/2020 |
At the end of the procedure, i.e. after having successfully:
- Clone the GitHub repository
- Download the data sources and cross-references
- Download the Dockers and Singularity images
- Install Docker, Docker-compose and Singularity
- Load the Docker images on your system and start the containers
- Run the source code to build the databases for one species (either for H.sapiens or M.musculus)
the tree folder should look like the following one:
.
β
βββ 01_documentation [Documentation of the project]
βΒ Β βββ manual [User's manual]
βΒ Β βΒ Β βββ manual.html
βΒ Β βΒ Β βββ manual.pdf
βΒ Β βββ metadata [Metadata of the ORF datafreeze]
βΒ Β βΒ Β βββ DCMI_schema.xsd
βΒ Β βΒ Β βββ metadata.xml
βΒ Β βββ src [Source code documentation, generated with doxygen in HTML format]
βΒ Β βΒ Β βββ html
βΒ Β βββ workbook [Useful information regarding the source code]
βΒ Β βββ appendices [Workbook appendices helping dev and use of the source code]
βΒ Β βΒ Β βββ cell_contexts [Information regarding the cell contexts registered in the database]
βΒ Β βΒ Β βΒ Β βββ cell_context.csv
βΒ Β βΒ Β βββ db_schema [Database UML schemas]
βΒ Β βΒ Β βΒ Β βββ database_schema.pdf
βΒ Β βΒ Β βββ log_codes [List of error and warning codes that could be logged by the source code]
βΒ Β βΒ Β βΒ Β βββ LogCodes.ods
βΒ Β βΒ Β βββ miscellaneous [Miscellaneous information susceptible to help the users and developers]
βΒ Β βΒ Β βββ Ensembl_biotypes.csv
βΒ Β βββ computation_times.csv [Information regarding the expected time of execution of the source code]
βΒ Β βββ datafreeze_workflow.png [Datafreeze adviced workflow at png format]
βββ 02_container [Dockerfiles and singularity images]
βΒ Β βββ mysql [Docker compose to start mysql and adminer servers]
βΒ Β βΒ Β βββ adminer:4.7.1.tar.gz
βΒ Β βΒ Β βββ docker-compose.yml
βΒ Β βΒ Β βββ mysql:8.0.16.tar.gz
βΒ Β βΒ Β βββ readme.txt
βΒ Β βββ script [Containers for scripts]
βΒ Β βΒ Β βββ 01_crossreferences_download [Containers for 01_crossreferences_download scripts]
βΒ Β βΒ Β βΒ Β βββ dockerfile
βΒ Β βΒ Β βΒ Β βββ readme.txt
βΒ Β βΒ Β βΒ Β βββ tagc-uorf-orf-datafreeze-script-cross_ref_dl.tar.gz
βΒ Β βΒ Β βββ 03_orf_datasources_analysis [Containers for 03_orf_datasources_analysis scripts]
βΒ Β βΒ Β βββ dockerfile
βΒ Β βΒ Β βββ readme.txt
βΒ Β βΒ Β βββ tagc-uorf-orf-datafreeze-script-orf_ds_analysis.tar.gz
βΒ Β βββ src [Container to run the source code]
βΒ Β βββ dockerfile
βΒ Β βββ readme.txt
βΒ Β βββ tagc-uorf-orf-datafreeze-src.img
βΒ Β βββ tagc-uorf-orf-datafreeze-src.tar.gz
βββ 03_workflow [Scripts and files necessary to run the source code]
βΒ Β βββ datafreeze [Scripts allowing to build the datafreeze]
βΒ Β βββ declare_variables.sh [Script allowing to declare environment variables necessary to run the source code]
βΒ Β βββ model.sh [Model of script to use to run one or several strategies]
βΒ Β βββ AnnotateORF.sh
βΒ Β βββ AssessDatabaseContent.sh
βΒ Β βββ ComputeKozakContext.sh
βΒ Β βββ ComputeMissingInfo.sh
βΒ Β βββ ComputeRelCoord.sh
βΒ Β βββ DatabaseCheck.sh
βΒ Β βββ full_build_min.sh
βΒ Β βββ full_build.sh
βΒ Β βββ GenerateBEDContent.sh
βΒ Β βββ GenerateBEDFile.sh
βΒ Β βββ GenerateFastaFile.sh
βΒ Β βββ GenerateGFFFile.sh
βΒ Β βββ GenerateStatFiles.sh
βΒ Β βββ GenerateTrackDbFile.sh
βΒ Β βββ Insertion.sh
βΒ Β βββ LiftOver.sh
βΒ Β βββ Merge.sh
βΒ Β βββ ResumeMerge.sh
βββ 04_config [Config files necessary to run the source code]
βΒ Β βββ HsapiensConfigfile
βΒ Β βββ MmusculusConfigfile
βββ 05_script [Scripts for pre-processing analysis]
βΒ Β βββ 01_crossreferences_download [Scripts allowing to download the cross-references]
βΒ Β βΒ Β βββ DefaultOutputFolder.txt
βΒ Β βΒ Β βββ download.sh
βΒ Β βΒ Β βββ ensembl_gene_lists.R
βΒ Β βΒ Β βββ readme.txt
βΒ Β βββ 02_orf_datasources_download [Scripts allowing to download the ORF datasources - Empty]
βΒ Β βββ 03_orf_datasources_analysis [Scripts allowing a preliminary analysis of ORF datasources]
βΒ Β βββ estimate_splicing
βΒ Β βββ readme.txt
βββ 06_src [See doxygen documentation for full documentation of this folder content]
βββ 07_input [Input files]
β βββ cross_references [Cross references]
β βΒ Β βββ hsapiens_HGNC.txt
β βΒ Β βββ mmusculus.gene_info
β βββ ORF_datasources [ORF datasources]
β Β Β βββ hsapiens_Erhard2018.csv
β Β Β βββ hsapiens_Johnstone2016.txt
β Β Β βββ hsapiens_Laumont2016.csv
β Β Β βββ hsapiens_Mackowiak2015.txt
β Β Β βββ hsapiens_Samandi2017.tsv
β Β Β βββ hsapiens_sORFs.org.txt
β Β Β βββ mmusculus_Johnstone2016.txt
β Β Β βββ mmusculus_Mackowiak2015.txt
β Β Β βββ mmusculus_Samandi2017.tsv
β Β Β βββ mmusculus_sORFs.org.txt
βββ 08_output [Output files]
β βββ datafreeze
β βββ execution.log
β βββ generefwarnings.log
β βββ merged_data_analysis
β βΒ Β βββ dsota_count_for_orf_tr.csv
β βΒ Β βββ orf_tr_count_for_dsota.csv
β βββ content_consistency_assessment
β βΒ Β βββ DSDatabaseAssessment.tsv
β βΒ Β βββ PRODatabaseAssessment.tsv
β βββ bed_files
β βΒ Β βββ MetamORF_Hsapiens.bed
β βΒ Β βββ MetamORF_Hsapiens_without_scaffold.bed
β βββ fasta_files
β βΒ Β βββ MetamORF_Hsapiens_aa.fasta
β βΒ Β βββ MetamORF_Hsapiens_aa_wo_seq_with_stop.fasta
β βΒ Β βββ MetamORF_Hsapiens_nt.fasta
β βΒ Β βββ MetamORF_Hsapiens_nt_wo_seq_with_stop.fasta
β βββ track_files
β β βββ hg38.chrom.sizes
β β βββ MetamORF.as
β β βββ MetamORF.bb
β β βββ MetamORF.bed
β β βββ trackDb.txt
β βββ stat_files
β Β Β βββ log_code_counts.csv
β Β Β βββ log_level_counts.csv
βββ README.md
NB: Descriptions of the file / directory and additional information are provided between [brackets].