Handling Large Datasets
NGS WTAC 2011
David Jackson
WTSI
NGS => Large Datasets
If you're dealing with NGS data, you're dealing with "large" datasets.
- HiSeq run (208 cycles, 290 GBases)
runfolder
- 5.5TB off instrument
- 11TB with offline analysis
- 350GB (BAMs) to be kept
- MiSeq run (308 cycles, 2.1 GBases)
runfolder
- 30TB off instrument
- 50GB with offline analysis
- 1.6GB (BAMs) to be kept
Here at the Sanger
- 3 454s, 1 Pacbio, 1 Iontorrent, 1 MiSeq, several capillary machines
- 1 Illumina MiSeq, 2 GAs and 23 (and rising?) HiSeqs
To deal with the data from these machines
- Automation
- Computing infrastructure
Sanger Sequencing Informatics Infrastructure
Compute
- 2 racks * 16 blades
- 7 racks * ~14 blades
With a batch queuing system and custom software to push tailored analyses onto it....
Sanger Sequencing Informatics Infrastructure
Storage
27 * storage servers
- Samba for instrument connection
- NFS for compute farm connection
- 60TB
And the network. And custom monitoring systems....
Sanger Sequencing Informatics
Current Limiting Factors
- Often RAM per CPU
- Even more often IO - read and write to storage
Avoiding IO Limits for Big Data
Reading and writing data from and to storage/disk (the IO) may slow down down your analyses
- network drive
- shared with other processes/analyses and other users
- a slow drive (like your external USB)
- combinations of the above
Remedies?
- High performance filesystems e.g. Lustre
- Avoid the IO - use pipes (where possible)
Big files and Lots of them...
It's not just that your data files are large - there are likely to be lots of them....
Plexing:
- Default Illumina tagset has 12
- Custom tags are quite common - plexing up to 96 at large sequencing centres
- Illumina introducing pairs of tags so I'd anticipate upto 12*12 (even off a MiSeq)....
To deal with this some automation is required for
Using the shell, creating scripts, is perhaps the easiest way to do this.
Web Based Analysis Pipelines
There are some good web browser accessible tools available:
- PacBio, IonTorrent, MiSeq
- Galaxy -- open and not tied to a particular vendor
These are becoming availble as cloud based services.
Bare in mind
- they are (inevitably?) less flexible
- harder to fix when your interesting data break them
- often use available hardware resources in a simple, potentially wasteful, manner
- analysis pipelines have to be reimplemented when moving between such web based platforms
Questions?