first previous next last contents

List Base Confidence

The various base-callers may produce a confidence value for each base call. Previous sections describe how this may be used to produce a consensus sequence along with a consensus confidence.

This function tabulates the frequency of each base confidence value along with a count of how many times is matches or mismatches the consensus. Given that the standard scale for confidence values follows the -10log10(probability of error) formula we can determine what the expected frequency of mismatches should be for any particular confidence value. By comparing this with our observed frequencies we then have a powerful summary of the amount of misassembled data.

Total bases considered : 45270
Problem score          : 1.337130

Conf.        Match        Mismatch           Expected      Over-
value         freq            freq               freq  representation
---------------------------------------------------------------------
  0              0               0               0.00      0.00
  1              0               0               0.00      0.00
  2              0               0               0.00      0.00
  3              0               0               0.00      0.00
  4             37              22              23.49      0.94
  5              0               0               0.00      0.00
  6             89              46              33.91      1.36
  7            119              26              28.93      0.90
  8            256              37              46.44      0.80
  9            368              30              50.11      0.60
 10            669              31              70.00      0.44
...

In the above example we see that there are 59 sequence bases with confidence 4, of which 37 match the consensus and 22 do not. If we work on the assumption that the consensus is correct then we would expect approximately 40% of these to be incorrect, but we have measured 37% to be incorrect (22/59) giving 0.94 fraction of the expected amount.

For a more problematic assembly, we may see a section of output like this:

Total bases considered : 1617511
Problem score          : 311.591358

Conf.        Match        Mismatch           Expected      Over-
value         freq            freq               freq  representation
---------------------------------------------------------------------
...
 20          13432             384             138.16      2.78
 21          23384             851             192.51      4.42
 22          18763             487             121.46      4.01
 23          13712             300              70.23      4.27
 24          21182             363              85.77      4.23
 25          20466             218              65.41      3.33
 26           9752             123              24.80      4.96
 27          23071             282              46.60      6.05
 28          13816             158              22.15      7.13
 29          27514             166              34.85      4.76
 30          15664             140              15.80      8.86
...

We can see here that the observed mismatch frequency is greatly more than the expected number. This indicates the number of misassemblies (or SNPs in the case of mixed samples) within this project and is reflected by the combined "Problem score". This score is simply the sum of the final column (or 1 over that column for values less than 1.0).


first previous next last contents
Last generated on 25 November 2011.