GitHub

Please note, we are in the process of increasing the documentation and cleanliness of the code herein. If you find something missing, please contact one of the committers (brant faircloth or nick crawford).

This repository holds computer code (and, temporarily, data) used as part of McCormack et al. XXXX.

[CITE]

There are a large number of independent and interdependent programs within this repository. The fruits of these programs labor are found in the Downloads as described below. Since many people are mostly interested in these data, we describe them first.

There is more documentation, including details on workflow, the purpose of specific programs, etc. in the documentation:

data

We make a number of data files available (at present) on this site. In essence, these data are the fruits of our labor. the data files are as follows.

uce_probe.sqlite.bz2

This is an sqlite database containing the data that are central to our probe design process. the tables are:

cons - this table contains matches that we identified in alignments (MAF) between chicken and lizard using the initial version (0.1) of the genome/summary.pycode in this repository (see "tags" below).
blast - this table contains results from a blast of the matches from cons onto the zebra finch genome. For these matches we used the 0.1 tagged genome/summaryBlast.py code
gallus_refseq - this is a table of the refseq genes in galGal3. it would likely be best to download a recent version of these data from NCBI or UCSC rather than using what is here for anything.
probes - this is a table of the probes that we designed from the UCEs we located in chicken, lizard, and zebra finch. We designed these probes using `design/sure_select_tiler.py

uce_bed.bz2

This is a BED format file of the UCEs we located that are shared between chicken, lizard, and zebra finch. The locations within this file are relative to chicken (galGal3).

uce_probes.bed.bz2

This is a BED format file of the locations of the probes we designed. The locations within this file are relative to chicken (galGal3).

probe_matches_to_other_genomes.sql.bz2

This is a bzipped dump file of the mysql database in which we stored our alignments of each probe sequence to other genomes. Initially and for error checking purposes, we conducted the align, clean, and insert steps by hand (as seen in Simulation/STEPS.md). This process has now largely been automated in Future/run_mutiple_lastz.py. the tables are:

organizational/metadata tables

all_orgs - this table contains data from all of the matches to individual species listed below
cons - this table is identical to the cons table from uce_probe.sqlite. it is repeated here for ease of use
group_XAB_5 - this contains the list of probes found in all members of [monDom5, loxAfr3, choHof1, hg19, mm9]
group_Bats_5 - this contains the list of probes found in all members of [eriEur1, equCab2, bosTau4, canFam2, pteVam1]
group_Bats_7 - this contains the list of probes found in all members of [eriEur1, equCab2, bosTau4, canFam2, pteVam1, myoLuc1, ailMel1]
group_Bats_8 - this contains the list of probes found in all members of [eriEur1, pteVam1, myoLuc1, bosTau4, vicPac1, equCab2, canFam2, ailMel1]
group_Bats_12 - this contains the list of probes found in all members of [eriEur1, pteVam1, myoLuc1, bosTau4, vicPac1, susScr9, oviAri1, turTru1, equCab2, canFam2, felCat3, ailMel1]
group_elephants_7 - this contains the list of probes found in all members of [loxAfr3, canFam2, echTel1, choHof1, dasNov2, hg19, monDom5]
group_size_19 - this contains the list of probes found in all members of [anoCar2, bosTau4, calJac3, canFam2, cavPor3, chinese, equCab2, gorGor3, hg19, korean, loxAfr3, mm9, monDom5, oryCun2, panTro2, ponAbe2, rheMac2, taeGut1, venter]
group_size_23 - this contains the list of probes found in all memebrs of [bosTau4, canFam2, cavPor3, chinese, dasNov2, echTel1, equCab2, gorGor3, hg19, korean, loxAfr3, mm9, monDom5, ochPri2, oryCun2, panTro2, ponAbe2, pteVam1, rn4, speTri1, tarSyr1, tupBel1, venter]
group_size_25 - this contains the list of probes found in all members of [anoCar2, bosTau4, calJac3, canFam2, cavPor3, chinese, dipOrd1, equCab2, gorGor3, hg19, korean, loxAfr3, mm9, monDom5, ornAna1, oryCun2, panTro2, ponAbe2, pteVam1, rheMac2, rn4, taeGut1, tarSyr1, venter, vicPac1]
group_size_29 - this contains the list of probes found in all members of [anoCar2, bosTau4, calJac3, canFam2, cavPor3, chinese, choHof1, dipOrd1, echTel1, equCab2, eriEur1, gorGor3, hg19, korean, loxAfr3, mm9, monDom5, ornAna1, oryCun2, panTro2, ponAbe2, pteVam1, rheMac2, rn4, taeGut1, tarSyr1, tupBel1, venter, vicPac1]
probe_distribution - this contains a binary "matrix" indicating presence/absence (1/0) of probe matches by species
probes - this is a table providing the ids of the probes we designed
species - this table provides information on the genome build of each organism to which we aligned probes
sureselect - this table is identical to the probes table from uce_probe.sqlite

lastz matches of probes to individual species (build version, name, etc in

species above)

ailMel1
anoCar2
bosTau4
calJac3
canFam2
cavPor3
chinese
choHof1
danRer6
dasNov2
dipOrd1
echTel1
equCab2
eriEur1
felCat3
gasAcu1
gorGor3
hg19
korean
loxAfr3
macEug1
micMur1
mm9
monDom5
myoLuc1
ochPri2
ornAna1
oryCun2
otoGar1
oviAri1
panTro2
ponAbe2
proCap1
pteVam1
rheMac2
rn4
sorAra1
speTri1
susScr9
taeGut1
tarSyr1
tetNig1
tupBel1
turTru1
venter
vicPac1
xenTro2

code

The code is available at http://github.com/BadDNA/seqcap/. This file is the top-level README.

You will notice, likely at first glance, that some of the code seems all over the place (style-wise). That's because, to some degree, it is. We began this project during 2008, and it has stretched to the present.

There are several programs that were use-once-and-forget, and others that became indispensable or are newer/prettier/better/etc. I would say, too, that you can see some evolution in the code itself. About 3/4 of the way into the project, I (BCF) also started to better follow PEP8 which you'll also likely notice. Nick Crawford (NGC) was better about following PEP8 than I.

You will also notice, should you scrutinize the code, that some programs write to an sqlite database while others write to a mysql database. The reasons for this additional level of complexity are several-fold. Generally speaking, we started using sqlite as the initial database for holding data generated as part of this project, but we moved to mysql when demands for concurrency required that we use a database supporting concurrent writes (sqlite does not).

Three additional notes:

we have moved the code within this repository here from a private repository that I (BCF) maintain for the development portions of this project. You should generally be happy about this, beacuse it has allowed us to do a fair amount of housekeeping. If you believe a program is missing that may be in this private repository, please let me know, and I'll attempt to move it over.
we have an updated workflow for a number of the steps detailed below, particularly the initial steps of UCE location and probe design. When the time comes, we will tag pertinent files in the current repo, and then move in the new bits.
some of the methods/code within are likely confusing to others, particularly if you are trying to piece together what we did without actually reading the code. For the most part, we'll try to give you some guidance, but you'll also need to read the code. It may be helpful to enlist someone with knowledge of Python to aid this process.

Acknowledgments

We thank the UCSC genome browser, in particular, for being an awesome resource that enables much of the work within. We also thank all of the organizations that have made genomic sequences available for the many organisms we've used as part of this and other work. Lastly, we should thank github for easing what otherwise would have been a complicated collaboration.

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
alignment		alignment
clustering		clustering
design		design
doc		doc
genome		genome
lib		lib
phylo		phylo
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

alignment

alignment

clustering

clustering

design

design

doc

doc

genome

genome

lib

lib

phylo

phylo

.gitignore

.gitignore

.gitmodules

.gitmodules

README.md

README.md

Repository files navigation

data

uce_probe.sqlite.bz2

uce_bed.bz2

uce_probes.bed.bz2

probe_matches_to_other_genomes.sql.bz2

organizational/metadata tables

lastz matches of probes to individual species (build version, name, etc in

code

Acknowledgments

About

Releases

Packages

Languages

BadDNA/seqcap

Folders and files

Latest commit

History

Repository files navigation

data

uce_probe.sqlite.bz2

uce_bed.bz2

uce_probes.bed.bz2

probe_matches_to_other_genomes.sql.bz2

organizational/metadata tables

lastz matches of probes to individual species (build version, name, etc in

code

Acknowledgments

About

Resources

Stars

Watchers

Forks

Languages