Python AnnoyIndex.kneighbors示例

def detect_doublets(E,
                    doub_frac=3.0,
                    exclude_abundant_genes=1.0,
                    min_counts=3,
                    min_cells=5,
                    vscore_percentile=75,
                    precomputed_pca=[],
                    num_pc=50,
                    k=30,
                    genes_use=[],
                    include_doubs_in_pca=False,
                    return_parents=False,
                    pca_method='sparse',
                    use_approxnn=False,
                    counts=[]):
    '''
    Identifies likely doublets by finding cells that are highly similar to
    simulated doublets (combinations of random pairs of observed cells). The
    output is a "doublet score" for each observed cell. Note that these scores
    should NOT be interpreted as probabilities, though they range between 0 and 1.

    After assigning scores, you should examine the histogram of doublet scores
    and overlay the scores on a 2-D/3-D visualization of the data
    (e.g., t-SNE or SPRING). When a reasonable number of doublets are present,
    I usually see a long right tail on the histogram (or a bimodal distribution)
    and co-localization (clustering) of high-scoring cells in the
    t-SNE/SPRING plot.

    There are a few complementary strategies for using the doublet scores to
    remove likely doublets:

    1) If you have clustered your data, you could remove entire clusters that
       are primarily comprised of cells with high doublet scores.

    2) Set a threshold based on the histogram.

    3) Use the expected doublet rate (e.g., 5%) to inform your threshold. Since
       doublets resulting from the combination of two highly similar cells will
       not be detected, this is likely an upper bound on the number of
       detectable doublets.

    Using default settings, the following steps are run to assign doublet scores:
    Using a single-cell counts matrix as input,
    1. Cell-level normalization by total counts
    2. Find highly variable genes expressed above a minimum expression level
    3. Run PCA using highly variable genes
    4. Simulate doublets by averaging the PCA coordinates of random pairs of
       cells (termed "parents"). The average is weighted by the total counts of
       the parent cells.
    5. Build a k-nearest-neighbor graph using the PCA coordinates of observed
       and simulated cells (Euclidean distance is used by default).
    6. For each observed cell, calculate the doublet score as the fraction of
       neighbors that are simulated doublets. The score is adjusted to account
       for the number of simulated doublets.

    The key parameters are:
    - k (number of neighbors in the knn graph): as long as this isn't too low
      or way too high, the scores generally aren't very sensitive. A reasonable
      place to start is ~sqrt(number of cells)
    - num_pc (number of PCs to use when constructing the knn graph): The optimal
      value will depend on the complexity of the dataset, but again the scores
      shouldn't be super sensitive to this parameter. If you've used PCA for
      other analyses (e.g., clustering or t-SNE), a similar value should work
      well here (or just re-use your pre-generated PCA coordinates).
    - doub_frac (number of doublets to simulate, as a fraction of the number
      of observed cells): Simulating more doublets will only improve the
      accuracy of the detector, especially for very complex datasets, since
      you'll more closely approximate the true distribution of doublets. Of
      course, the cost is a longer compute time. I usually start with a
      doub_frac of 3-5.


    INPUTS:
    - E: Counts matrix (2-D numpy array; rows=cells, columns=genes)

    - doub_frac: Number of doublets to simulate, as a fraction of the number
      of cells in E. In general, the higher the better, though it will slow
      things down.

    - k: Number of neighbors for KNN graph. This will automatically be scaled by
      doub_frac. To start, try setting k=int(sqrt(number of cells))

    - min_counts and min_cells: For filtering genes based on expression levels.
      To be included, genes must be expressed in at least min_counts copies in
      at least min_cells cells.

    - vscore_percentile: For filtering genes based on variability. V-score is a
      measure of above-Poisson noise. To be included, genes must have a V-score
      in the top vscore_percentile percentile.

    - num_pc: number of principal components for constructing knn graph

    - genes_use: pre-filtered list of gene names to use for PCA; if supplied,
      no additional gene filtering is performed

    - include_doubs_in_pca: use simulated doublets for PCA; if True, runs much
      slower and unclear if performance is improved


    OUTPUTS:
    - doub_score_obs: doublet scores of observed cells

    - doub_score: doublet score of observed and simulated cells

    - doub_labels: labels for the indices of doub_score. 0 for observed cells
      and 1 for simulated doublets

    '''

    if len(counts) == 0:
        counts = np.sum(E, axis=1)

    if len(precomputed_pca) == 0:
        if include_doubs_in_pca:
            print 'Simulating doublets'
            E, doub_labels, parent_ix = simulate_doublets_from_counts(
                E, doublet_frac=doub_frac)

        print 'Total count normalizing'
        E = tot_counts_norm(E, exclude_dominant_frac=exclude_abundant_genes)

        if len(genes_use) == 0:
            print 'Finding highly variable genes'
            Vscores, CV_eff, CV_input, gene_ix, mu_gene, FF_gene, a, b = get_vscores(
                E)
            gene_filter = (
                (np.sum(E[:, gene_ix] >= min_counts, axis=0) >= min_cells) &
                (Vscores > np.percentile(Vscores, vscore_percentile)))
            gene_filter = gene_ix[gene_filter]
        else:
            gene_filter = genes_use

        print 'Using', len(gene_filter), 'genes for PCA'
        PCdat = get_PCA(E[:, gene_filter], numpc=num_pc, method=pca_method)

        if not include_doubs_in_pca:
            print 'Simulating doublets'
            PCdat, doub_labels, parent_ix = simulate_doublets_from_pca(
                PCdat, counts, doublet_frac=doub_frac)
    else:
        PCdat = precomputed_pca
        print 'Simulating doublets'
        PCdat, doub_labels, parent_ix = simulate_doublets_from_pca(
            PCdat, counts, doublet_frac=doub_frac)

    n_obs = np.sum(doub_labels == 0)
    n_sim = np.sum(doub_labels == 1)

    ########################

    k_detect = int(round(k * (1 + n_sim / float(n_obs))))

    print 'Running KNN classifier with k = %i' % k_detect

    #########
    if use_approxnn:
        try:
            from annoy import AnnoyIndex
        except:
            use_approxnn = False
            print 'Could not find library "annoy" for approx. nearest neighbor search.'
            print 'Using sklearn instead.'
    if use_approxnn:
        print 'Using approximate nearest neighbor search.'
        # Approximate KNN using Annoy
        npc = PCdat.shape[1]
        ncell = PCdat.shape[0]
        model = AnnoyIndex(npc, metric='euclidean')
        t0 = time.time()
        for i in xrange(ncell):
            model.add_item(i, list(PCdat[i, :]))
        t1 = time.time() - t0
        print 'Annoy: cells added %.5f sec' % t1
        t0 = time.time()
        model.build(10)  # 10 trees
        t1 = time.time() - t0
        print 'Annoy: index built %.5f sec' % t1

        t0 = time.time()
        neighbors = []
        for iCell in xrange(ncell):
            neighbors.append(model.get_nns_by_item(iCell, k_detect + 1)[1:])
        neighbors = np.array(neighbors, dtype=int)
        t1 = time.time() - t0
        print 'Annoy: KNN built %.5f sec' % (t1)
    else:
        t0 = time.time()
        model = KNeighborsClassifier(n_neighbors=k_detect, metric='euclidean')
        model.fit(PCdat, doub_labels)
        neighbors = model.kneighbors(return_distance=False)
        t1 = time.time() - t0
        print 'KNN built %.5f sec' % (t1)

    ############

    n_doub_neigh = np.sum(doub_labels[neighbors] == 1, axis=1)
    n_sing_neigh = np.sum(doub_labels[neighbors] == 0, axis=1)

    doub_score = n_doub_neigh / (n_doub_neigh +
                                 n_sing_neigh * n_sim / float(n_obs))
    doub_score_obs = doub_score[doub_labels == 0]

    if return_parents:
        print 'Aggregating parents of doublets (this can be slow)'
        doub_neigh_parents = []
        neigh = model.kneighbors(PCdat, return_distance=False)
        for ii in range(n_obs):
            iineigh = neigh[ii, :]
            doubix = iineigh[doub_labels[iineigh] == 1]
            if len(doubix) > 0:
                doub_neigh_parents.append(parent_ix[doubix - n_obs, :])
            else:
                doub_neigh_parents.append([])
        print 'Done'
        return doub_score_obs, doub_score, doub_labels, doub_neigh_parents

    print 'Done'
    return doub_score_obs, doub_score, doub_labels