Assembler module only depends on Python 2.7
.
Suggested packages, which are required for visualization module: graphviz
, seaborn
.
python Assemble.py [-h] --fragments FRAGMENTS [-o OUTPUT]
Assembles a chromosomal sequence from a FASTA
text file contianing fragments using a de Bruijin Graph assembler adapted from teaching materials of Dr. Ben Langmead.
-
Make de Bruijn graph by chopping fragments into k-mers
-
Make Eulerian by:
- Collapsing multi-edges into one by only considerng each k-mer once
-
Find Eulerian path and do Eulerian walk to get assembly prediction
May be impossible to make an Eulerian graph, at which point the assembler complains. This would generally be true if there are errors in reads, which I believe is prevented by the assumption at-least-half overalap (see below) and the setting of k
. If I had more time, I would attempt a more formal proof of the above, and if I failed to do so, I would:
- Removing transitively-inferable edges in the de Bruijn Graph (edges skipping one or more nodes) (if these were the cause of non-Eulerian nature)
- Fragments overlap (share sequence) with at least one other fragment
- Sharing region is ≥ ½ the length of each fragment
- ∃ unique way to reconstruct entire chromsome from input sequences by aligning reads
- Fragments of length ≤ 1000
See https://github.com/ijoseph/ChromAssembler/blob/master/Visualizations.ipynb for output and visualizations