A Python utility for evenly downsampling polymorphisms from a population of sequences.
- Start with a sequence alignment.
- Collate together all positions that show polymorphisms, i.e. not 100% conserved.
- Randomly pick one position.
- At that position, randomly pick one of the polymorphisms.
- Filter out sequences such that we are left with those that have that polymorphism at that position.
- Randomly pick one sequence out.
- Figure out which other polymorphisms are covered by that sequence, and remove them from consideration.
- Add the chosen sequence to a collated set, and remove it from further consideration.
- Repeat until:
- No more polymorphisms need to be found.
- No more sequences are available.