コード例 #1
0
ファイル: methods.py プロジェクト: cxhernandez/pycogent
def PB(alignment, order=DNA_ORDER):
    """Returns sequence weights based on the diversity at each position.

    The position-based (PB) sequence weighting method is described in Henikoff
    1994. The idea is that sequences are weighted by the diversity observed
    at each position in the alignment rather than on the diversity measured
    for whole sequences.

    A simple method to represent the diversity at a position is to award 
    each different residue an equal share of the weight, and then to divide 
    up that weight equally among the sequences sharing the same residue. 
    So if in a position of a MSA, r different residues are represented, 
    a residue represented in only one sequence contributes a score of 1/r to 
    that sequence, whereas a residue represented in s sequences contributes 
    a score of 1/rs to each of the s sequences. For each sequence, the 
    contributions from each position are summed to give a sequences weight.

    See Henikoff 1994 for a good example.
    """
    #calculate the contribution of each character at each position
    pos_weights = pos_char_weights(alignment, order)
    d = pos_weights.Data
    
    result = Weights()

    for key,seq in list(alignment.items()):
        weight = 0
        for idx, char in enumerate(seq):
            weight += d[order.index(char),idx]
            result[key] = weight
    
    result.normalize()
    return result
コード例 #2
0
ファイル: methods.py プロジェクト: mikerobeson/pycogent
def PB(alignment, order=DNA_ORDER):
    """Returns sequence weights based on the diversity at each position.

    The position-based (PB) sequence weighting method is described in Henikoff
    1994. The idea is that sequences are weighted by the diversity observed
    at each position in the alignment rather than on the diversity measured
    for whole sequences.

    A simple method to represent the diversity at a position is to award 
    each different residue an equal share of the weight, and then to divide 
    up that weight equally among the sequences sharing the same residue. 
    So if in a position of a MSA, r different residues are represented, 
    a residue represented in only one sequence contributes a score of 1/r to 
    that sequence, whereas a residue represented in s sequences contributes 
    a score of 1/rs to each of the s sequences. For each sequence, the 
    contributions from each position are summed to give a sequences weight.

    See Henikoff 1994 for a good example.
    """
    #calculate the contribution of each character at each position
    pos_weights = pos_char_weights(alignment, order)
    d = pos_weights.Data

    result = Weights()

    for key, seq in alignment.items():
        weight = 0
        for idx, char in enumerate(seq):
            weight += d[order.index(char), idx]
            result[key] = weight

    result.normalize()
    return result
コード例 #3
0
 def test_weights(self):
     """Weights: should behave like a normal dict and can be normalized
     """
     w = Weights({'seq1': 2, 'seq2': 3, 'seq3': 10})
     self.assertEqual(w['seq1'], 2)
     w.normalize()
     exp = {'seq1': 0.1333333, 'seq2': 0.2, 'seq3': 0.6666666}
     self.assertFloatEqual(w.values(), exp.values())
コード例 #4
0
ファイル: test_util.py プロジェクト: miklou/pycogent
 def test_weights(self):
     """Weights: should behave like a normal dict and can be normalized
     """
     w = Weights({'seq1':2, 'seq2':3, 'seq3':10})
     self.assertEqual(w['seq1'],2)
     w.normalize()
     exp = {'seq1':0.1333333, 'seq2':0.2, 'seq3':0.6666666}
     self.assertFloatEqual(w.values(), exp.values())
コード例 #5
0
ファイル: methods.py プロジェクト: cxhernandez/pycogent
def mVOR(alignment,n=1000,order=DNA_ORDER):
    """Returns sequence weights according to the modified Voronoi method.
    
    alignment: Alignment object
    n: sample size (=number of random profiles to be generated)
    order: specifies the order of the characters found in the alignment,
        used to build the sequence and random profiles.
    
    mVOR is a modification of the VOR method. Instead of generating discrete
    random sequences, it generates random profiles, to sample more equally from
    the sequence space and to prevent random sequences to be equidistant to 
    multiple sequences in the alignment. 

    See the Implementation notes to see how the random profiles are generated
    and compared to the 'sequence profiles' from the alignment.

    Random generalized sequences (or a profile filled with random numbers):
    Sequences that are equidistant to multiple sequences in the alignment
    can form a problem in small datasets. For longer sequences the likelihood
    of this event is negligable. Generating 'random generalized sequences' is 
    a solution, because we're then sampling from continuous sequence space. 
    Each column of a random profile is generated by normalizing a set of 
    independent, exponentially distributed random numbers. In other words, a 
    random profile is a two-dimensional array (rows are chars in the alphabet, 
    columns are positions in the alignment) filled with a random numbers, 
    sampled from the standard exponential distribution (lambda=1, and thus 
    the mean=1), where each column is normalized to one. These random profiles 
    are compared to the special profiles of just one sequence (ones for the 
    single character observed at that position). The distance between the 
    two profiles is simply the Euclidean distance.

    """
    
    weights = zeros(len(alignment.Names),Float64)

    #get seq profiles
    seq_profiles = {}
    for k,v in list(alignment.items()):
        #seq_profiles[k] = ProfileFromSeq(v,order=order)
        seq_profiles[k] = SeqToProfile(v,alphabet=order)

    for count in range(n):
        #generate a random profile
        exp = exponential(1,[alignment.SeqLen,len(order)])
        r = Profile(Data=exp,Alphabet=order)
        r.normalizePositions()
        #append the distance between the random profile and the sequence
        #profile to temp
        temp = [seq_profiles[key].distance(r) for key in alignment.Names]
        votes = row_to_vote(array(temp))
        weights += votes
    weight_dict = Weights(dict(list(zip(alignment.Names,weights))))
    weight_dict.normalize()
    return weight_dict
コード例 #6
0
ファイル: methods.py プロジェクト: mikerobeson/pycogent
def mVOR(alignment, n=1000, order=DNA_ORDER):
    """Returns sequence weights according to the modified Voronoi method.
    
    alignment: Alignment object
    n: sample size (=number of random profiles to be generated)
    order: specifies the order of the characters found in the alignment,
        used to build the sequence and random profiles.
    
    mVOR is a modification of the VOR method. Instead of generating discrete
    random sequences, it generates random profiles, to sample more equally from
    the sequence space and to prevent random sequences to be equidistant to 
    multiple sequences in the alignment. 

    See the Implementation notes to see how the random profiles are generated
    and compared to the 'sequence profiles' from the alignment.

    Random generalized sequences (or a profile filled with random numbers):
    Sequences that are equidistant to multiple sequences in the alignment
    can form a problem in small datasets. For longer sequences the likelihood
    of this event is negligable. Generating 'random generalized sequences' is 
    a solution, because we're then sampling from continuous sequence space. 
    Each column of a random profile is generated by normalizing a set of 
    independent, exponentially distributed random numbers. In other words, a 
    random profile is a two-dimensional array (rows are chars in the alphabet, 
    columns are positions in the alignment) filled with a random numbers, 
    sampled from the standard exponential distribution (lambda=1, and thus 
    the mean=1), where each column is normalized to one. These random profiles 
    are compared to the special profiles of just one sequence (ones for the 
    single character observed at that position). The distance between the 
    two profiles is simply the Euclidean distance.

    """

    weights = zeros(len(alignment.Names), Float64)

    #get seq profiles
    seq_profiles = {}
    for k, v in alignment.items():
        #seq_profiles[k] = ProfileFromSeq(v,order=order)
        seq_profiles[k] = SeqToProfile(v, alphabet=order)

    for count in range(n):
        #generate a random profile
        exp = exponential(1, [alignment.SeqLen, len(order)])
        r = Profile(Data=exp, Alphabet=order)
        r.normalizePositions()
        #append the distance between the random profile and the sequence
        #profile to temp
        temp = [seq_profiles[key].distance(r) for key in alignment.Names]
        votes = row_to_vote(array(temp))
        weights += votes
    weight_dict = Weights(dict(zip(alignment.Names, weights)))
    weight_dict.normalize()
    return weight_dict
コード例 #7
0
ファイル: methods.py プロジェクト: cxhernandez/pycogent
def VOR(alignment,n=1000,force_monte_carlo=False,mc_threshold=1000):
    """Returns sequence weights according to the Voronoi weighting method.

    alignment: Alignment object
    n: sampling size (in case monte carlo is used)
    force_monte_carlo: generate pseudo seqs with monte carlo always (even
        if there's only a small number of possible unique pseudo seqs
    mc_threshold: threshold of when to use the monte carlo sampling method
        if the number of possible pseudo seqs exceeds this threshold monte
        carlo is used.

    VOR differs from VA in the set of sequences against which it's comparing
    all the sequences in the alignment. In addition to the sequences in the 
    alignment itself, it uses a set of pseudo sequences.
    
    Generating discrete random sequences: 
    A discrete random sequence is generated by choosing with equal
    likelihood at each position one of the residues observed at that position 
    in the alighment. An occurrence of once in the alignment column is 
    sufficient to make the residue type an option. Note: you're choosing 
    with equal likelihood from each of the observed residues (independent 
    of their frequency at that position). In earlier versions of the algorithm 
    the characters were chosen either at the frequency with which they occur 
    at a position or at the frequency with which they occur in the database. 
    Both trials were unsuccesful, because they deviate from random sampling 
    (see Sibbald & Argos 1990).

    Depending on the number of possible pseudo sequences, all of them are 
    used or a random sample is taken (monte carlo).

    Example:
    Alignment: AA, AA, BB
        AA      AA      BB
    AA  0 (.5)  0 (.5)  2
    AB  1 (1/3) 1 (1/3) 1 (1/3)
    BA  1 (1/3) 1 (1/3) 1 (1/3)
    BB  2       2       0 (1)
    -----------------------------
    total 7/6     7/6     10/6
    norm  .291    .291    .418

    For a bigger example with more pseudo sequences, see Henikoff 1994

    I tried the described optimization (pre-calculate the distance to the
    closest sequence). I doesn't have an advantage over the original method.
    """
    
    MC_THRESHOLD = mc_threshold
    
    #decide on sampling method
    if force_monte_carlo or number_of_pseudo_seqs(alignment) > MC_THRESHOLD:
        sampling_method = pseudo_seqs_monte_carlo
    else:
        sampling_method = pseudo_seqs_exact
    #change sequences into arrays
    aln_array = DenseAlignment(alignment, MolType=BYTES)
    weights = zeros(len(aln_array.Names),Float64)
    #calc distances for each pseudo seq
    rows = [array(seq, 'c') for seq in map(str, aln_array.Seqs)]
    for seq in sampling_method(aln_array,n=n):
        seq = array(seq, 'c')
        temp = [hamming_distance(row, seq) for row in rows]
        votes = row_to_vote(array(temp)) #change distances to votes
        weights += votes #add to previous weights
    weight_dict = Weights(dict(list(zip(aln_array.Names,weights))))
    weight_dict.normalize() #normalize

    return weight_dict
コード例 #8
0
ファイル: methods.py プロジェクト: mikerobeson/pycogent
def VOR(alignment, n=1000, force_monte_carlo=False, mc_threshold=1000):
    """Returns sequence weights according to the Voronoi weighting method.

    alignment: Alignment object
    n: sampling size (in case monte carlo is used)
    force_monte_carlo: generate pseudo seqs with monte carlo always (even
        if there's only a small number of possible unique pseudo seqs
    mc_threshold: threshold of when to use the monte carlo sampling method
        if the number of possible pseudo seqs exceeds this threshold monte
        carlo is used.

    VOR differs from VA in the set of sequences against which it's comparing
    all the sequences in the alignment. In addition to the sequences in the 
    alignment itself, it uses a set of pseudo sequences.
    
    Generating discrete random sequences: 
    A discrete random sequence is generated by choosing with equal
    likelihood at each position one of the residues observed at that position 
    in the alighment. An occurrence of once in the alignment column is 
    sufficient to make the residue type an option. Note: you're choosing 
    with equal likelihood from each of the observed residues (independent 
    of their frequency at that position). In earlier versions of the algorithm 
    the characters were chosen either at the frequency with which they occur 
    at a position or at the frequency with which they occur in the database. 
    Both trials were unsuccesful, because they deviate from random sampling 
    (see Sibbald & Argos 1990).

    Depending on the number of possible pseudo sequences, all of them are 
    used or a random sample is taken (monte carlo).

    Example:
    Alignment: AA, AA, BB
        AA      AA      BB
    AA  0 (.5)  0 (.5)  2
    AB  1 (1/3) 1 (1/3) 1 (1/3)
    BA  1 (1/3) 1 (1/3) 1 (1/3)
    BB  2       2       0 (1)
    -----------------------------
    total 7/6     7/6     10/6
    norm  .291    .291    .418

    For a bigger example with more pseudo sequences, see Henikoff 1994

    I tried the described optimization (pre-calculate the distance to the
    closest sequence). I doesn't have an advantage over the original method.
    """

    MC_THRESHOLD = mc_threshold

    #decide on sampling method
    if force_monte_carlo or number_of_pseudo_seqs(alignment) > MC_THRESHOLD:
        sampling_method = pseudo_seqs_monte_carlo
    else:
        sampling_method = pseudo_seqs_exact
    #change sequences into arrays
    aln_array = DenseAlignment(alignment, MolType=BYTES)
    weights = zeros(len(aln_array.Names), Float64)
    #calc distances for each pseudo seq
    rows = [array(seq, 'c') for seq in map(str, aln_array.Seqs)]
    for seq in sampling_method(aln_array, n=n):
        seq = array(seq, 'c')
        temp = [hamming_distance(row, seq) for row in rows]
        votes = row_to_vote(array(temp))  #change distances to votes
        weights += votes  #add to previous weights
    weight_dict = Weights(dict(zip(aln_array.Names, weights)))
    weight_dict.normalize()  #normalize

    return weight_dict