Overview

This document provides a quality assessment of Genome Analyzer results. The assessment is meant to complement, rather than replace, quality assessment available from the Genome Analyzer and its documentation. The narrative interpretation is based on experience of the package maintainer. It is applicable to results from the 'Genome Analyzer' hardware single-end module, configured to scan 300 tiles per lane. The 'control' results refered to below are from analysis of PhiX-174 sequence provided by Illumina.

Run summary

Read counts. Filtered and aligned read counts are reported relative to the total number of reads (clusters; if only filtered or aligned reads are available, total read count is reported). Consult Genome Analyzer documentation for official guidelines. From experience, very good runs of the Genome Analyzer 'control' lane result in 25-30 million reads, with up to 95% passing pre-defined filters.

  ShortRead:::.plotReadCount(qa)
./image/readCount.jpg

Base call frequency over all reads. Base frequencies should accurately reflect the frequencies of the regions sequenced.

  ShortRead:::.plotNucleotideCount(qa)
./image/baseCalls.jpg

Overall read quality. Lanes with consistently good quality reads have strong peaks at the right of the panel.

  df <- qa[["readQualityScore"]]
  ShortRead:::.plotReadQuality(df[df$type=="read",])
./image/readQuality.jpg

Read distribution

These curves show how coverage is distributed amongst reads. Ideally, the cumulative proportion of reads will transition sharply from low to high.

Portions to the left of the transition might correspond roughly to sequencing or sample processing errors, and correspond to reads that are represented relatively infrequently. 10-15%; of reads in a typical Genome Analyzer 'control' lane fall in this category.

Portions to the right of the transition represent reads that are over-represented compared to expectation. These might include inadvertently sequenced primer or adapter sequences, sequencing or base calling artifacts (e.g., poly-A reads), or features of the sample DNA (highly repeated regions) not adequately removed during sample preparation. About 5% of Genome Analyzer 'control' lane reads fall in this category.

Broad transitions from low to high cumulative proportion of reads may reflect sequencing bias or (perhaps intentional) features of sample preparation resulting in non-uniform coverage. the transition is about 5 times as wide as expected from uniform sampling across the Genome Analyzer 'control' lane.

  df <- qa[["sequenceDistribution"]]
  ShortRead:::.plotReadOccurrences(df[df$type=="read",], cex=.5)
./image/readOccurences.jpg

Common duplicate reads might provide clues to the source of over-represented sequences. Some of these reads are filtered by the alignment algorithms; other duplicate reads might point to sample preparation issues.

  ShortRead:::.freqSequences(qa, "read")
sequencecountlane
751AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA149020719
51AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 89735709
701AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 36828718
1AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 30752708
351NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN 26737714_2
801AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 25165720
1351NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN 13528727_2
201AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 11224712
1051NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN 10125724_2
851AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 9769721
151AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 8963711
1151NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN 8570725_2
1251NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN 8481726_2
951AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAANA 7450723
651NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN 7182717_2
551NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN 6481716_2
451NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN 6218715_2
852GTCCTTTCGTACTAAAATATCACAATTTTTTAAAGATAGAAACCA 4897721
901GTCCTTTCGTACTAAAATATCACAATTTTTTAAAGATAGAAACCA 4401722
952AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACNA 4264723

Common duplicate reads after filtering

  ShortRead:::.freqSequences(qa, "filtered")
NA

Common aligned duplicate reads are

  ShortRead:::.freqSequences(qa, "aligned")
NA

Cycle-specific base calls and read quality

Per-cycle base call should usually be approximately uniform across cycles. Genome Analyzer `control' lane results often show a deline in A and increase in T as cycles progress. This is likely an artifact of the underlying technology.

  perCycle <- qa[["perCycle"]]
  ShortRead:::.plotCycleBaseCall(perCycle$baseCall)
./image/perCycleBaseCall.jpg

Per-cycle quality score. Reported quality scores are `calibrated', i.e., incorporating phred-like adjustments following sequence alignment. These typically decline with cycle, in an accelerating manner. Abrupt transitions in quality between cycles toward the end of the read might result when only some of the cycles are used for alignment: the cycles included in the alignment are calibrated more effectively than the reads excluded from the alignment.

The reddish lines are quartiles (solid: median, dotted: 25, 75), the green line is the mean. Shading is proporitional to number of reads.

  perCycle <- qa[["perCycle"]]
  ShortRead:::.plotCycleQuality(perCycle$quality)
./image/perCycleQuality.jpg

Adapter Contamination

Adapter contamination is defined here as non-genetic sequences attached at either or both ends of the reads. The 'contamination' measure is the number of reads with a right or left match to the adapter sequence over the total number of reads. Mismatch rates are 10% on the left and 20% on the right with a minimum overlap of 10 nt.

  ShortRead:::.ppnCount(qa[["adapterContamination"]])
Not available.

Tue Jun 14 05:27:43 2011; ShortRead v. 1.11.17
Report template: Martin Morgan