MSI Preprocessing Pipeline

Default preprocessing pipeline for MSI data in raw ASCII format, as used by Data Mining Group in Silesian University of Technology.

Process

The packaged pipeline consists of the following steps:

Find common m/z range
Find resampled m/z axis that will be common for all datasets
Resample all datasets to common m/z axis
Remove baseline using adaptive window method (as proposed by Katarzyna Frątczak).
Detect outliers in the data with respect to TIC value
Align spectra to average spectrum with PAFFT method
Normalize spectra to common TIC
Build Gaussian Mixture Model of the average spectrum
Remove outlier components of the GMM model
Compute convolutions of the spectra and the GMM model
Merge multiple GMM components resembling single peak

Installation

The preferred installation is via Docker.

Having Docker installed, you can just pull the image:

docker pull gmrukwa/msi-preprocessing

Running

Data Format

You need to prepare your data for processing:

Create an empty directory /mydata
Create a directory /mydata/raw - this is where pipeline expects your original data
Each dataset should be contained in subdirectory:

/mydata
    |- raw
        |- my-dataset1
        |- my-dataset2
        |- ...

Each subdirectory should contain ASCII files in the organization as provided by Bruker, e.g.:

/mydata
    |- raw
        |- my-dataset1
            |- my-dataset1_0_R00X309Y111_1.txt
            |- my-dataset1_0_R00X309Y112_1.txt
            |- my-dataset1_0_R00X309Y113_1.txt
            |- my-dataset1_0_R00X309Y114_1.txt
            |- my-dataset1_0_R00X309Y115_1.txt
            |- ...

Note: File names are important, since R, X and Y is parsed as metadata! If you put there broken values, spatial dependencies between spectra will be lost.

Each ASCII file should be in the format as provided by Bruker, e.g.:

700,043096125457 2
700,051503297599 2
700,059910520559 1
700,068317794335 0
...
<another-mz-value> <another-ions-count>
...
3496,66186447226 1
3496,68071341296 3
3496,69956240485 2

Both . and , are supported as decimal separator.

Example of the expected structure can be found in sample-data.

Launch

You can launch preprocessing via:

docker run -v /mydata:/data -p 8082:8082 gmrukwa/msi-preprocessing '["my-dataset1","my-dataset2"]'

Results will appear in the /mydata directory as soon as they are available. You can track the progress live at localhost:8082.

If you need output data also in the format of .csv files (not a binary numpy-related .npy), you can simply add a switch --export-csv:

docker run --rm -ti -v /mydata:/data -p 8082:8082 gmrukwa/msi-preprocessing '["my-dataset1","my-dataset2"]' --export-csv

Note: There is no space between dataset names.

Note: The --export-csv switch must appear right after the datasets (due to the way Docker handles arguments).

If you want to review time needed for each task to process, you can prevent scheduler from being stopped with --keep-alive switch:

docker run --rm -ti -v /mydata:/data -p 8082:8082 gmrukwa/msi-preprocessing '["my-dataset1","my-dataset2"]' --keep-alive

Note: --keep-alive switch must always come last.

Launch Sample

Download sample-data directory
Run docker run -v sample-data:/data -p 8082:8082 gmrukwa/msi-preprocessing '["my-dataset1","my-dataset2"]'
Track progress at localhost:8082

Building GMM model takes longer time (at least 1 hour), so be patient.

Advanced

CPUs Limit

It may happen that the data is actually too big to be copied across all your CPUs. Then it may be useful to limit the number of the CPU cores exploited. You can do this via additional switch --pool-size. By default all cores are used (or single one, when detection was impossible).

Example:

docker run --rm -ti -v /mydata:/data -p 8082:8082 gmrukwa/msi-preprocessing '["my-dataset1","my-dataset2"]' --pool-size 2 --keep-alive

E-Mail Notifications

You can simply add e-mail notifications to your configuration. They will provide you with failure messages and notification, when the pipeline completes successfully. Two methods are supported: via SendGrid and via SMTP server.

SendGrid

Create API key on SendGrid account (try here)
Download template.env as .env
In the .env file, set following values (rest of the content preserve intact):

LUIGI_EMAIL_METHOD=sendgrid
LUIGI_EMAIL_RECIPIENT=<your-email-here>
LUIGI_SENDGRID_APIKEY=<your-api-key-here>

Launching processing with Docker, use additional switch --env-file .env:

docker run --rm -ti -v /mydata:/data --env-file .env gmrukwa/msi-preprocessing '["my-dataset1","my-dataset2"]'

SMTP Server

For your e-mail provider, get configuration of mail program, like here
Download [template.env](./template.env) as .env`
In the .env file, set following values (rest of the content preserve intact):

LUIGI_EMAIL_METHOD=smtp
LUIGI_EMAIL_RECIPIENT=<your-email-here>
LUIGI_EMAIL_SENDER=<your-email-here>

LUIGI_SMTP_HOST=<smtp-host-of-your-provider>
LUIGI_SMTP_PORT=<smtp-port-of-your-provider>
LUIGI_SMTP_NO_TLS=<False-if-your-provider-uses-TLS-True-otherwise>
LUIGI_SMTP_SSL=<False-if-your-provider-uses-TLS-True-otherwise>
LUIGI_SMTP_PASSWORD=<password-to-your-email-account>
LUIGI_SMTP_USERNAME=<login-to-your-email-account>

Launching processing with Docker, use additional switch --env-file .env:

docker run --rm -ti -v /mydata:/data --env-file .env gmrukwa/msi-preprocessing '["my-dataset1","my-dataset2"]'

History Persistence

Task history is collected to SQLite database. If you want to persist the database, you need to mount the /luigi directory. This can be done via:

docker run --rm -ti -v tasks-history:/luigi -v /mydata:/data gmrukwa/msi-preprocessing

Name		Name	Last commit message	Last commit date
Latest commit History 206 Commits
.devcontainer		.devcontainer
.github		.github
.idea		.idea
.vscode		.vscode
bin		bin
components		components
docker		docker
pipeline		pipeline
sample-data/raw		sample-data/raw
test		test
.dockerignore		.dockerignore
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
entrypoint.sh		entrypoint.sh
luigi.cfg		luigi.cfg
plot.py		plot.py
requirements-base.txt		requirements-base.txt
requirements.txt		requirements.txt
template.env		template.env
to_csv.py		to_csv.py
wipe.sh		wipe.sh

License

gmrukwa/msi-preprocessing-pipeline

Folders and files

Latest commit

History

Repository files navigation

MSI Preprocessing Pipeline

Process

Installation

Running

Data Format

Launch

Launch Sample

Advanced

CPUs Limit

E-Mail Notifications

SendGrid

SMTP Server

History Persistence

About

Topics

Resources

License

Stars

Watchers

Forks

Sponsor this project

Languages