ISB-CGC-User-Data-Processor

This is a repository for the user data processor. It was originally from the isb-cgc-data-proc repository, but seemed different enough to require its own repository.

The bigquery_etl module has been copied from the isb-cgc-data-proc repository.

This code is deployed on a Jenkins slave node and run by Jenkins.

###General process of the processor:

Read through the config file, pull out relevant bits of information and separate all the user_gen data files from the rest of the molecular datatypes.
Process all user_gen files together as one 2.1 Download file from cloud storage. 2.2 Get column mappings, renaming columns to the mapping provided. 2.3 Insert data of each file into metadata_data table for the study. 2.4 Merge all files into one dataframe on SampleBarcode. NOTE: This assumes that all user_gen files provided will have a mapping to SampleBarcode. 2.5 Insert the table into the user's metadata_samples table for the study. 2.6 Create and Update BigQuery table by writing to a temporary file and uploading that file to BigQuery. 2.7 Generate new feature definitions for each column in metadata_samples table except SampleBarcode. 2.8 Delete temporary file.
Process each molecular datatype file individually 3.1 Download file from cloud storage. 3.2 Convert file to dataframe. 3.3 Get column mappings that map the columns in the file to the correct columns in the BigQuery Schema. NOTE: Each molecular file is to have this format:

Symbol	Feature ID	Tab	Sample ID 1	Sample ID 2	Sample ID 3
BRCA	BRCA ID	Optional Information	Value	Value	Value
EGFR	EGFR ID	Optional Information	Value	Value	Value
TP53	TP53 ID	Optional Information	Value	Value	Value

3.4 Convert matrix into denormalized rows based on sample id to store in BigQuery 3.5 Generate metadata_data rows from samples in file and insert into metadata_data table for the study. 3.6 Update metadata_samples table for samples that exist and insert new samples that don't exist. 3.7 Generate new feature definitions for datatype based on unique symbols. 3.8 Create and Update BigQuery table by writing ot a temporary file and uploading that file to BigQuery. 3.9 Delete temporary file.

###Big Query Schemas:

Molecular Data Type Schema (mrna, mirna, protein, meth)

Name	Type	Description
SampleBarcode	String	Sample barcode
Project	INTEGER	User's Project ID this value is associated with. This refers to the in-app Project model.
Study	INTEGER	User's Study ID this value is associated with. This refers to the in-app Study model.
Platform	STRING	Platform used to generate this value.
Pipeline	STRING	Pipeline used to generate this value.
Symbol	STRING	Can represent the gene symbol, mirna name. This column is mainly used for filtering depending on the datatype.
ID	STRING	Can represent the gene ID, mirna ID, probe ID. This column is mainly used for filtering depending on the datatype.
Tab	STRING	Can represent extra information such as protein name. This is an additional column that can be used for storing extra information.
Level	FLOAT	Actual values associated to the sample and datatype. This represents beta levels, expression levels, or counts.

User Generated Data Schema (user_gen)

Name	Type	Description
SampleBarcode	String	Sample barcode
Project	INTEGER	User's Project ID this value is associated with. This refers to the in-app Project model.
Study	INTEGER	User's Study ID this value is associated with. This refers to the in-app Study model.

These are the only columns that are required in this schema. All other columns are generated when the data is provided and customized for the data processed.

###Environment Variables for .env file

Name	Description or Value
db_host	Host of database
db	Name of database
db_user	User for database connection
db_password	Password for user
ssl_cert	If ssl required, path to client-cert.pem
ssl_key	If ssl required, path to client-key.pem
ssl_ca	If ssl required, path to server-ca.pem
privatekey_path	Path to privatekey.json that's generated by gcloud_authenticate.sh
tmp_bucket_location	Bucket name to write temporary files that are used to upload to BigQuery

###Additional Environment Variables for Jenkins

Name	Description
GAE_CLIENT_EMAIL	Client email from privatekey.json
GAE_CLIENT_ID	Client ID from privatekey.json
GAE_PRIVATE_KEY	Private key from privatekey.json
GAE_PRIVATE_KEY_ID	Private key ID from privatekey.json
GCLOUD_BUCKET	Place to download .env and ssl files from

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
bigquery_etl		bigquery_etl
shell		shell
user_gen		user_gen
utils		utils
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
config_template.json		config_template.json
requirements.txt		requirements.txt
test_config.json		test_config.json
test_user_gen_config.json		test_user_gen_config.json
user-data-processor.py		user-data-processor.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bigquery_etl

bigquery_etl

shell

shell

user_gen

user_gen

utils

utils

.gitignore

.gitignore

LICENSE.txt

LICENSE.txt

README.md

README.md

config_template.json

config_template.json

requirements.txt

requirements.txt

test_config.json

test_config.json

test_user_gen_config.json

test_user_gen_config.json

user-data-processor.py

user-data-processor.py

Repository files navigation

ISB-CGC-User-Data-Processor

About

Releases

Packages

Languages

License

BlinkUX/User-Data-Processor

Folders and files

Latest commit

History

Repository files navigation

ISB-CGC-User-Data-Processor

About

Resources

License

Stars

Watchers

Forks

Languages