Exemplo n.º 1
0


        # epitope_tree_cmds.extend(cmds)
        # epitope_tree_width = max( epitope_tree_width, total_svg_width )
        # epitope_tree_height += 2200

        all_cmds[ ab ].extend( plotter.cmds )
        all_cmds[ ab ].extend( cmds )

    total_y_offset += ypad + tree_height ## increment once per epitope


for ab in all_cmds:
    ## now make the svg file
    prefix = '{}_tall_tree_{}'.format(outfile_prefix,ab)

    svg_height = total_y_offset+ypad+ymargin
    svg_basic.create_file( all_cmds[ab], total_svg_width, svg_height, prefix+'.svg', create_png=True )

    if ab == 'AB':
        util.readme( prefix+'.png',"""
        These are TCRdist clustering trees for the different repertoires, with distances calculated over both chains. Down below this in the .html output are the clustering trees for distances calculated over each chain individually. To make these trees, the repertoire is first clustered with a fixed distance threshold (a TCRdist of {:.2f} for single chain distances and {:.2f} for paired alpha+beta chain distances) using a simple greedy approach that iteratively finds the TCR with the greatest number of neighbors within that distance, adds that cluster center and its neighbors to the list of clusters and deletes them from the repertoire, repeating until all the receptors have been clustered. These clusters are the leaves of the tree, and they are joined together for visualization purposes by an average-linkage hierarchical clustering approach that uses the matrix of distances between the cluster centers. The vertical thickness of the leaves and branches is proportional to the number of TCR clones represented by those branches. Repertoires with more than 300 receptors are subsampled to 300 after clustering so that the number of leaves in the tree doesn't become too large. Trees with all the TCR clones represented can be found by following the links toward the top of the .html output.<br><br>

        TCR logos are shown to the left of the tree for a representative subset of the branches (enclosed in dashed boxes labeled with their size; logos and boxed branches come in the same vertical order). Each logo panel shows the V- (left) and J- (right) gene frequencies in 'logo' format (height scaled by frequency, most frequent gene at the top; the IMGT gene names are trimmed to remove the leading TRAV/TRAJ/TRBV/TRBJ). In the middle is a CDR3 amino acid sequence logo. The colored bars below the CDR3 logo summarize the inferred rearrangement history of the grouped receptors by showing the nucleotide source, colored as follows: V region, light gray; J region, dark gray; D region, black; N insertions, red. THe number of TCRs contributing to the logo is shown to the left and should match the number next to the corresponding boxed branch of the tree.<br><br>

        This analysis is conducted at the level of clonotypes -- each expanded clone is condensed to a single receptor sequence for the purpose of clustering.

        """.format( cluster_radius['A'] / distance_scale_factor,
                    cluster_radius[ab] / distance_scale_factor ))
Exemplo n.º 2
0
plt.subplots_adjust(left=0.15,
                    bottom=0.15,
                    top=0.925,
                    right=0.925,
                    hspace=0.02,
                    wspace=0.02)
pngfile = '{}_epitope_correlations_{}.png'.format(clones_file[:-4],
                                                  nbrdist_percentile)
print 'making:', pngfile
plt.savefig(pngfile)
util.readme(
    pngfile,
    """These heat maps show correlations between nbrdist scores for different repertoire datasets.
The idea is that, given one reference epitope-specific repertoire, we can assign a number to any TCR in any of the
different epitope-specific repertoires which is the average distance
to the nearest TCRs in the reference repertoire. Two different repertoires give us two different sets of numbers that we
can compare
over the entire merged set of TCRs and ask how well they correlate. Two very similar repertoires should give similar
nbrdist scores for most TCRs in the big dataset, whereas very different repertoires will have uncorrelated nbrdist
scores. What's plotted is just the linear Pearson correlation coefficient between the nbrdist scores assigned by
each pair of reference repertoires, colored from blue (values less than 0.4) to dark red (1.0 -- perfect correlation).
""")

A = np.zeros((len(epitopes), len(epitopes)))
D = np.zeros((len(epitopes), len(epitopes)))
Log('analyzing epitope-epitope mutual nbrdist scores')
for i_ep1, ep1 in enumerate(epitopes):
    for i_ep2, ep2 in enumerate(epitopes):
        if i_ep2 < i_ep1: continue
        for chains in ['A', 'B', 'AB']:
            dists = []
            ns = []
Exemplo n.º 3
0
plt.figure(1, figsize=(fig_width, fig_height))

plt.subplots_adjust(hspace=hspace,
                    wspace=wspace,
                    left=left_margin,
                    right=right_margin,
                    bottom=bottom_margin,
                    top=top_margin)
#plt.suptitle('epitope={}   2D kernal-PCA projection'.format(epitope),size='large')

pngfile = pngfile_prefix + '_kpca.png'
print 'making', pngfile
plt.savefig(pngfile, dpi=paper_figs_dpi)

util.readme(
    pngfile, """
Kernel Principle Components Analysis (kPCA) 2D projection plots for the repertoires. Each row is a repertoire,
and each point in the plots corresponds to a single TCR clone, with the points arranged so as to keep nearby
TCRs (as measured by TCRdist) nearby in 2 dimensions. The four different panels are the same 2D projection
colored by gene usage for the four different segments (left to right: Va,Ja,Vb,Jb)"""
)

if show:
    plt.show()

print "kPCA results:"
print "clone.id" + "\t" + "epitope" + "\t" + "XS" + "\t" + "YS" + "\t" + "Color" + "\t" + "\t".join(
    [("kPC" + str(i)) for i in range(jcmaxlen - 5)])
for l in kPCAset:
    print l
        sum(l_others) / len(l_others), nbrdist_tag_suffix))

outlog.close()

for ii_roc in range(2):
    for ii_chains, chains in enumerate(ABs):
        figno = 2 * ii_chains + ii_roc + 1
        figtype = 'roc' if ii_roc else 'nbrdists'
        pngfile = '{}_{}_{}.png'.format(outfile_prefix, figtype, chains)
        print 'making:', pngfile
        plt.figure(figno)
        plt.savefig(pngfile)
        if ii_roc == 0:
            util.readme(
                pngfile,
                """KDE-smoothed histograms of different {} nbrdist measures, comparing each epitope-specific repertoire (red)
            to TCRs from the other repertoires (green) and to random TCRs (blue).<br><br>
            """.format(chains))
        else:
            util.readme(
                pngfile,
                """ROC curves showing true- (y-axis) versus false-positives (x-axis) as the sorting metric increases. Legends give the area under the curve (AUROC) values, ranging from 100 (perfect discrimination, curve goes straight
            up then straight across), to 50 (random), to 0 (total failure, all negatives come before all positives, curve goes straight across then straight up).<br><br>
            """.format(chains))

figno = 7
pngfile = '{}_summary.png'.format(outfile_prefix)
plt.figure(figno)
print 'making:', pngfile
plt.savefig(pngfile)
util.readme(
Exemplo n.º 5
0
#xstep = (right_margin-left_margin) / ( len(all_mice)-1 )

for ii, epitope in enumerate(all_epitopes):
    #name = mouse[:]
    # if name[0] == 'd' and 'Mouse' in name:
    #     name = name.replace('Mouse','_')
    plt.figtext(left_margin + xwidth / 2 + ii * xwidth,
                bottom_margin - (bottom_spacer) / fig_height,
                epitope_labels[epitope],
                rotation='vertical',
                ha='center',
                va='top',
                fontdict={'fontsize': fontsize_names})

    #plt.figtext(left_margin + xwidth/2 + ii * xwidth, 0.98, epitope, ha='center', va='top' )

pngfile = outfile_prefix + '_subject_table.png'
print 'making:', pngfile
plt.savefig(pngfile)

util.readme(
    pngfile,
    """This subject-table plot shows all the successfully parsed, paired reads, split by mouse/subject (the rows)
and epitope (the columns, labeled at the bottom). The epitope column labels include in parentheses the number of clones followed by
the total number of TCRs. Each pie shows the paired reads for a single mouse/epitope combination, with each wedge corresponding to
a clone. The size of the top clone is shown in red near the red wedge, and the total number of reads is shown below the pie in black.
""")

if show:
    plt.show()
Exemplo n.º 6
0
            if not paper_figs:
                plt.subplots_adjust(left=0.1 if no_mouse_labels else 0.4,
                                    top=0.92,
                                    bottom=0.08,
                                    right=0.9,
                                    wspace=0.03)
                plt.suptitle(
                    'mouse tree for {} showing gene composition of mouse repertoires'
                    .format(epitope))

            ## now save the figure
            pngfile = '{}_{}_subject_tree.png'.format(outfile_prefix, epitope)
            print 'making:', pngfile
            plt.savefig(pngfile)
            util.readme(
                pngfile, """
This is a hierarchical clustering dendrogram of the subjects for the {} repertoire. The distance between a pair of subjects is defined to be the average distance between receptors in one subject and receptors in the other. The gene frequency composition of each subject is shown in the four stacks of colored bars in the middle (to the left of the tree, which is drawn with thin blue lines). To the right of the tree is a key for the gene segment coloring schemes which also shows the frequencies of each gene in the combined repertoire.
""".format(epitope))

            plt.close()

if plotno == 0:
    Log('no multi-subject epitopes??')
    exit()

if paper_figs:  ## just the individual tree images
    exit()

plt.suptitle('mouse trees based on avg TCR-TCR distance between mice',
             x=0.5,
             y=1.0)
pngfile = '{}_subject_trees.png'.format(outfile_prefix)
Exemplo n.º 7
0
    for epitope in epitopes:
        vals = []
        for tag,fmt in header:
            if tag not in all_dats[epitope]:
                val = 'N/A'
            else:
                val = '{:{}}'.format( all_dats[epitope][tag], fmt )
            vals.append( val )
        out.write('\t'.join( vals )+'\n' )
    out.close()


## make distribution plots
from scipy.stats import gaussian_kde
pngfile = clones_file[:-4]+'_cdr3_distributions.png'
util.readme( pngfile, """
Distributions of CDR3 properties for the different epitopes""" )

colors = html_colors.get_rank_colors_no_lights(len(epitopes))

scoretag_suffixes = ['_len','_charge','_hydro1','_hydro2']

nrows = 3
ncols = len(scoretag_suffixes)

plt.figure(1,figsize=(12,12))
plotno=0
for ab in ['a','b','ab']:
    for suf in scoretag_suffixes:
        plotno += 1
        plt.subplot(nrows,ncols,plotno)
        scoretag = ab+suf
Exemplo n.º 8
0
        plt.xticks( [0.4,1.4], ['no\n({})'.format(totals[0]),'yes\n({})'.format(totals[1])], fontsize=8 )
        plt.xlabel(xlabel)
        plt.ylabel(ylabel)
        plt.title('P-value\n{:.3g}'.format(p_table))
        print 'overall_P {:.3g} {} vs {}'.format( p_table, '_'.join( xlabel.split()), '_'.join( ylabel.split()) )


    plt.subplots_adjust(bottom = 0.1, top=0.9, right=0.98, left=0.07, wspace=0.65 )

    filetag = '{}_nbrdist{}'.format( '_wtd' if wtd_nbrdist else '', str(nbrdist_percentile) )
    pngfile = '{}_sharing_and_clonality{}.png'.format(outfile_prefix,filetag)
    print 'making',pngfile
    plt.savefig(pngfile)
    util.readme(pngfile,"""These plots explore the relationship between clonality and sharing of TCRs across mice for the same epitope. For the purpose of
    this analysis a TCR is "clonal" if it has a clone_size of at least 2 and is "shared" if it is seen in more than one mouse (ie subject). protprob and nucprob are the
    amino acid and nucleotide (respectively) generation probabilities under a very simple model of the rearrangement process. nbrdist_rank_score is a measure of
    repertoire sampling density nearby a given TCR: we compute an average distance to a TCRs nearest neighbors ("{}") and then percentile this over the repertoire to
    get a normalized nbr-distance measure that goes from 0 (many nearby TCRs in the repertoire) to 100 (very few).
    """.format(rank_suffix))

    if epitopes is None:
        epitopes = all_tcrs.keys()[:]
        epitopes.sort()

    plt.figure(2,figsize=(14,4*len(epitopes)))


    nrows = len(epitopes)
    ncols = 7

    plotno=0
Exemplo n.º 9
0
        else:
            plt.title(
                "cross-reactive probability, same mice (same colors as above)")

plt.subplots_adjust(bottom=0.1, top=0.97, left=0.05, right=0.97, hspace=0.3)

pngfile = '{}.png'.format(outfile_prefix)
print 'making:', pngfile
plt.savefig(pngfile)
util.readme(
    pngfile,
    """These next five plots reflect different notions of sharing or repetition in the repertoire (ie, seeing "the same" TCR more than once).
The top two plots corresponds to repetition within a single mouse (ie, clonality), and give two representations of Simpson's measure: 1-Simpson's in the top
plot and 1/Simpson's in the bottom. The top plot shows 1-Simpson's for each of the individual mice (labeled by #reads in the mouse) together with an averaged value over
the entire repertoire (red disk, mice are weighted based on number of reads).
<br>
The middle plot uses the Simpson's diversity framework to analyze sharing across mice for the same epitope. The three colors correspond to three notions of
sharing: red points-- seeing the exact same TCR (alpha or beta or both chains); blue points-- seeing two TCRs within a (small) distance of one another; green points-- a
Gaussian-smoothed version of the blue points ("TCRdiv", see explanatory text for the very first figure on this page).
<br>
The bottom two plots look at cross-reactivity: sharing of TCRs between epitopes, either between different mice (plot#4) or within the same mouse (plot#5). The colors
are the same as in the middle plot.
""")

# util.make_readme( pngfile, """

# These four plots

# """ )

## read the other diversity summary statistic
for line in open(logfile, 'r'):