GitHub - COL-IU/XLSearch: XLSearch: a probabilistic database searchalgorithm for identifying cross-linked peptides

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
SpecAnno		SpecAnno
library		library
testDatabase		testDatabase
testmzXML		testmzXML
LICENSE.txt		LICENSE.txt
PARAM.txt		PARAM.txt
README.txt		README.txt
annotation.py		annotation.py
parameter.txt		parameter.txt
readme.txt		readme.txt
train_model.py		train_model.py
xlsearch.py		xlsearch.py
xlsearch.zip		xlsearch.zip
xlsearch_search.py		xlsearch_search.py
xlsearch_train.py		xlsearch_train.py

Repository files navigation

XLSearch, Version 1.0
Copyright of School of Informatics and Computing, Indiana University
Contact: jic@indiana.edu, sujli@indiana.edu

I. INTRODUCTION
This software is intended to perform database sequence search for identifying
chemically cross-linked peptide pairs from tandem mass spectra. Usage of this
software is free of charge for academic purposes.

II. PREREQUISITES

i. Software packages

This software can be run on Unix/Linux operating systems.
1. Python version 2.6 or higher is required.
2. To perform the in-sample training (i.e. 'training mode'), additional
python modules (Numpy 1.6.1 or higher, Scipy 0.9 or higher, Scikit-learn
0.15 or higher) are required.

ii. Data
1. mzXML files containing tandem mass spectra converted using msconvert
(http://proteowizard.sourceforge.net/tools.shtml) from RAW files.
NOTE: Currently only mzXML format is supported.
2. Fasta file containing the desired protein sequences to be searched
against.

III. USAGE

XLSearch can be run in 'searching mode' and 'training mode'. Searching
mode is intended to perform the database sequence search where the peptide
spectrum matches (PSM) are assigned a score based on the computed features
that describe the maching quality between spectrum and each individual peptide,
as well as weights of pre-trained logisitic models. Training mode is intended
to re-train the logistic models using authentic cross-link PSMs obtained from
the new data.

i. Searching mode
Input: 1) PARAM.txt Contains parameters for performing the database searching.
2) database.fasta Fasta format text file containing amino acid sequences
in fasta format. Specified in 'PARAM.txt'.

Steps:
1. Preparation:
a. Unzip the .zip file to a directory (i.e. '/xlsearch_install_dir/'). It
should contain the python modules in '/xlsearch_install_dir/lib/', as well
as the pipline script for searching and training model ('xlsearch_search.py'
and 'xlsearch_train.py').

b. Create directory where search is to be performed (i.e. '/xlsearch_search_dir/').
c. Copy the file 'xlsearch_search.py' and 'PARAM.txt' to this directory.
d. Copy the fasta sequence file (i.e. 'database.fasta') to this directory.
e. Create directory where the mzXML files are located (i.e. '/xlsearch_search_dir/mzxml/').
f. Edit the parameter file 'PARAM.txt' as needed.

2. Perform datbase search
Under directory '/xlearch_search_dir/'

$ python xlsearch_search.py -l /xlsearch_install_dir/
-p PARAM.txt
-o output.txt

where '-l', '-p' and '-o' indicates the path to the library, parameter file
and the output file name. All three arguments are required.

3. Output file
A tab-delimited text file contains top-scoring PSM for each query spectrum.
Sorted by the joint probability score assigned to each PSM.
The first line contains the headers of the columns:
a. Rank of PSM
b. Sequence of alpha peptide
c. Sequence of beta peptide
d. Index of cross-linking site on alpha
e. Index of cross-linking site on beta
f. Protein header of alpha peptide
g. Protein header of beta peptide
h. Charge state
i. Joint probability score P(alpha = T, beta = T)
j. Margianl probability P(alpha = T)
k. Marginal probability P(beta = T)
l. The title of the query spectrum

4. Evalutating identified PSMs
The output file contains the top-scoring PSMs for each query spectrum sorted in descending
order of the joint probability score. The percentage of false positive identification at a
given score cutoff $S$ is estimated by counting the numbers of true-true, true-false, and
false-false PSMs whose scores are greater than $S$. Specifically,

FDR = (#(TF) - #(FF)) / #(TT)

To filter the output PSMs at a given score cutoff, provide the value of 'cutoff' and
'is_unique' in the parameter file, where 'cutoff' indicates the desired fdr cutoff,
and 'is_unique' ('True' or 'False') indicates whether the unique cross-linked peptides
(i.e. the combination of cross-linked peptides and charge) or the redundant PSMs are counted
in the FDR calculation. For example, to filter for the results at 1% FDR cutoff where the
redundant PSMs are counted, set 'cutoff' to 0.01 and 'is_unique' to False.

The filtered result will be written to file 'intra0.01.txt' and 'inter0.01txt' for intra-protein
and inter-protein cross-links.

ii. Training mode
Input: 1) PARAM.txt Contains parameters for performing the database searching.
2) target_database.fasta Contains only the TARGET sequences from which true-true
PSMs can be identified.
3) uniprot_database.fasta Contains the pool of protein sequences from which the
true-false and false-false PSMs can be generated based on the true-true PSMs.
4) true_true.psm (Optional) Contains the authentic true-true PSMs from which
the true-false and false-false PSMs can be genearted. Check the sample file for
format.

Steps:
1. Preparation:
a. Same as in searching mode.
b. Create directory where training is to be performed (i.e. '/xlsearch_train_dir/')
c. Copy 'xlsearch_train.py' to the current directory
d. Copy the fasta sequence file ('target_database.fasta', 'uniprot_database.fasta')
to the current directory
e. Create directory where the mzXML files are located (i.e. '/xlsearch_search_dir/mzxml/').
f. Edit the parameter file 'PARAM.txt' as needed.

2. Perform training

Under directory '/xlearch_train_dir/'
$ python xlsearch_search.py -l /xlsearch_install_dir/
-p PARAM.txt
-o output.txt

where '-l', '-p' indicates the path to the library, parameter file, and the output
file name. All three arguments are required.

3. Output file
The output will be in the following format:

CI00 ... weight 0 of classifier I
...
CI15 ... weight 15 of classifier I

CII00 ... weight 0 of classifier II
...
CII15 ... weight 15 of classifier II

nTT ... number of true-true PSMs
nTF ... number of true-false PSMs
nFF ... number of false-false PSMs

These lines correspond to the logistic regression parameters for classfier I and II
('CI' and 'CII'), and the numbers of true-true, true-false and false-false PSMs used
to train them ('nTT', 'nTF', 'nFF'). The parameters in the 'PARAM.txt' can be
overwritten by these lines to use the updated model.