first previous next last contents

Shuffle Pads

This function realigns all of the sequences within a contig to improve pad placement. This can be considered as the replacement to the old Shuffle Pads command within the contig editor. (Being outside of the editor allows this to be autoamtically scripted.) The contigs to realign are specified as either a single contig, all contigs or to input a contig names from a file or a gap4 list. Currently the entire contig will be shuffled, which can take some time on large contigs. In future we plan to allow regions to be specified.

Padding (gapping) problems originate in many sequence assembly algorithms, including gap4's, where sequences are aligned against a consensus rather than a profile. As an example let us consider aligning TCAAGAC (Sequence4) to the following contig:

Sequence1:    GATTCAAAGAC
Sequence2:      TTCAA*GACGG
Sequence3:        CAAAGACGGATC

Consensus:    GATTCAAAGACGGATC

The consensus contains a triple A because that is the most likely sequence, however we have three possible ways to align a sequence containing double A:

alignment1:      TCAA*GAC
alignment1:      TCA*AGAC
alignment1:      TC*AAGAC
Consensus:    GATTCAAAGACGGATC

All of these have identical alignment scores because the cost of inserting a gap into the sequence is identical at all points. Alignment algorithms typically always pick the same end to place pads (ie left end or right end), but after contigs get complemented and more data inserted this often yields pads at both as, as follows:

Sequence1:    GATTCAAAGAC
Sequence2:      TTCAA*GACGG
Sequence3:        CAAAGACGGATC
Sequence4:       TC*AAGAC
Consensus:    GATTCAAAGACGGATC

The new Shuffle Pads algorithm implements the same ideas put forward by Anson and Myers in ReAligner. It aligns each sequence against a consensus vector where the entire column of bases in the consensus are used to compute match, mismatch and indel scores. The result is that pads generally get shuffled to the same end (not necessarily always left or always right) and the total number of disagreements to the consensus reduces.

For speed we acknowledge that the new alignment will only deviate slightly from the old one and so a narrow "band size" is used. This paramater may be adjusted if required, but at the expense of speed.


first previous next last contents
Last generated on 25 November 2011.