The 2nd-Highest Confidence (see section 2nd-Highest Confidence) and the Diploid Graph (see section Diploid Graph both plot indicators of how likely an alignment column is to be made up of 2 or more sequence populations.
By studying these in further detail we should be able to spot correlated differences and to start assigning haplotypes. The SNP Candidate plot initially brings up a dialogue asking for a single contig and range. After selecting this a window is displayed showing the likely locations of SNPs as seen below.
The top row of this has controls to define how the 2nd-Highest
Confidence or Diploid Graph results are analysed in order to pick
candidate locations for SNPs.
Going from right to left, the "2 alleles only" toggle switches
between the two algorithms; when enabled it uses the additional
assumption coded into the Diploid Graph of their being only two
populations in approximately 50:50 ratio. Next the minimum base
quality may be adjusted. Any difference with a poorer quality than
this is completely ignored. The minimum discrepancy score is a
threshold (with high indicating a strong SNP) applied to the results
of the consistency plot results. A spike in this plot needs to be at
least as high as this score to be accepted. This score is then
adjusted for immediate proximity to other SNPs (e.g. it forms a run of
bases) and this adjusted score is compared against the minimum SNP
score parameter. Typically this can be left low. If any of these
parameters are modified press the "Recalculate candidate SNPs"
button to recompute.
The large central panel contains a vertically scrolled representation
of the candidate SNPs found. By default the left-most plot contains a
pictorial view of the sequence depth. Next to this is a vertical ruler
showing the relative positions of candidate SNPs. Both of these two
plots are to scale based on the sequence itself. To the right of these
come a series of text based items with one row per candidate
SNP. Initially this consists only of a check button ("Use"),
Position, Score and the frequency of base types observed at that
consensus column. Double clicking on any row will bring up the contig
editor at that position showing the potential SNP. You may manually
curate which ones you consider to be true or not by enabling or
disabling it use the "Use" checkbox on that row. The score may also
be manually adjusted allowing certain differences to be forced apart
by using a very high score.
The second row from the top contains a row of options controlling how
the correlation between candidate SNPs is used to assign
haplotypes. For every template in the contig the algorithm produces a
fake sequence consistencing only of the bases considered to be a
candidate SNP and enabled by having the "Use" checkbox set. These
fake sequences are then clustered to form groups. No re-alignment is
performed as the existing multiple alignment has already been made
(although you may wish to run the Shuffle Pads algorithm before hand
if the existing sequence alignment is poor).
This is a fairly standard clustering algorithm that starts with each
sequence being the sole member of a set. All sets are compared with
each other based on the correlation between sets using an adjusted
correlation score (achieved by subtracting "Correlation offset") and
then the overlaps are ranked by score. The best scoring
sets are then merged together. If Fast Mode is not being used the
merged set is then compared against everything else once more to
obtain new scores, otherwise a simple adjustment is guessed
at. Skipping this step speeds up the algorithm considerably and
generally gives sufficient results; hence the Fast Mode toggle. This
process is repeated until no two sets have an overlap score of greater
than or equal to the "Minimum merge score".
The Filter Templates button brings up a new dialogue box containing an
editable list (initially blank) of template names. Adding a template
name here will force this template to be ignored by the clustering
algorithm. You may also enter reading names here too and they will be
automatically converted to template names, hence filtering out all
other readings from the same template. If you or suspect specific
templates from being chimeric then this is where they should be listed.
The Cluster by SNPs button starts the clustering process running. It
cannot be interrupted and may take a few minutes. After completion the
"Sets" component (rightmost) of the central plot is updated as seen
in the below screenshot. Each set is a group of templates clustered
together based on the candidate SNPs. They are sorted in left to right
order such that the left-most set contains the most number of
templates and the right most set contains the fewest. The consensus
for members of that set is displayed in each square and the quality of
the consensus is shown in a similar fashion to the contig editor, with
white being good quality and dark grey being poor (usually due to
being low coverage within that set).
The background to the entire row is also shaded to indicate the
observed quality of that SNP in the context of this clustering. A
white background indicates that two or more sets exist with high
quality consensus bases (>= quality 90) that differ. A light grey
background is used where the consensus bases differ but not with high
quality bases. A dark grey background is used to indicate that the
consensus in all sets covering that SNP candidate agree. This
typically happens when either the clustering has failed or when a
candidate SNP is not a real indicator of which haplotype a sequence
belongs to, such as a base calling error or a random fluctuation in
homopolymer length. If you wish to force this SNP to be used for
clustering then try increasing its score and re-clustering again.
Hence in the above example we see two distinct good quality sets made
from the SNPs between 1503 and 2334 and two more good quality sets
from 12039 onwards. This indicates that we have no templates where one
end spans SNPs in the 1503-2334 region and the other end spans SNPs in
the 12039 onwards region. We also have a series of smaller sets which
probably arise due to incorrect base calls or more rarely due to
chimeras.
Now if we double click to get the contig editor up it will display an
additional window labelled "Tabs". NOTE: this does not happen if a
contig editor for this contig is already being displayed. If so shut
that one down first. Notice that the sequence names are also
coloured. This indicates the set the sequence has been assigned
to. The picture below also has the "Highlight Disagreements" mode
enabled with a difference quality cutoff sufficient set to match the
one used in the SNP Candidates plot. Two clear SNP positions can be
seen.
The tabs window lists the set numbers and their size (except for
"All"). Selecting a set will show just sequences from that set. This
allows for the set consensus and quality values to be viewed. The
editor also allows for sequences to be moved from one set to another,
but for now this is purely serves a visual purpose and the movements
are not passed back to the main SNP candidates window (although this
is an obvious change to make).
Moving back to the main SNP Candidates window note that we have a
series of selection buttons at the bottom of the window. These control
automatic selection of rows (SNPs) based on their quality assigned by
observing the set consensus sequences. The clustering algorithm only
works on selected sets so this allows for poor quality SNPs to be
removed from further calculations. Additionally to simplify the view
unselected SNPs may be removed by pressing the "Remove unselected" button.
Above each set has a checkbutton above it (not visible in the
screenshot). Initially these are not enabled, but they indicate which
sets certain operations should be performed on. Pressing the right
mouse button over a set (or a set checkbox) brings up a menu
indicating the following operations.
The final set of controls to discuss in the SNP Candidates window
control the splitting of sets into contigs. This is a one-way action
which cannot be undone, so make sure you backup the database using
Copy Database before hand.
The "Split sets to contigs" button moves the readings in each
selected set to its own contig. In some cases a set may be
non-contiguous. Remember that templates are assigned to sets, but a
template may often only have the end sequence known with the middle
portion being unsequenced. Gap4 does not currently handle scaffolds
and super-contigs so in order to keep such sets held together in a
single contig the "Add fake consensus" option may be used. This adds
an additional sequence to the contig that contains the consensus for
the set (including from readings that were unassigned). This also
handily means that new contigs produced from multiple sets are already
aligned and base coordinates are directly comparable. Hence two such sets may
be viewed in the Join Editor by typing their names into the main Join
Contigs dialogue. (Find Internal Joins will attempt to realign the
contigs and often fails if the set contains many regions of unknown
consensus.)
(Click for full size image)
(Click for full size image)
(Click for full size image)
Last generated on 25 November 2011.