Handling Large Datasets
NGS WTAC 2014
David Jackson
WTSI
NGS => Large Datasets
If you're dealing with NGS data, you're dealing with "large" datasets.
- HiSeq current 8 lane run (208 cycles, 320 GBases)
runfolder
- 0.5TB (or 5.5TB with 2 * intensities) off instrument
- 2TB for temporary (or 11TB if basecalling and calibrating) with offline analysis
- 150GB (CRAMs) to be kept
- HiSeq current rapid 2 lane run (208 cycles, 70 GBases)
runfolder
- 0.1TB (or 1.2TB with 2 * intensities) off instrument
- 0.3TB for temporary (or 2TB if basecalling and calibrating) with offline analysis
- 30GB (CRAMs) to be kept
- MiSeq run (208 cycles, 2.1 GBases)
runfolder
- 21GB (or 50TB with cif) off instrument
- 35GB with offline analysis
- 1.6GB (BAMs) to be kept
Here at the Sanger
- 1 Pacbio, 2 Iontorrent PGMs and 1 Ion Proton, several capillary machines
- 6 Illumina MiSeq, 11 HiSeqs 2500s (3 types for camera), 22 HiSeq 2000s
To deal with the data from these machines
- Automation
- Computing infrastructure
Sanger Sequencing Informatics Infrastructure
Compute
- 2 racks * 16 blades
- 7 racks * ~14 blades
With a batch queuing system and custom software to push tailored analyses onto it....
Sanger Sequencing Informatics Infrastructure
Storage

27 * storage servers
- Samba for instrument connection
- NFS for compute farm connection
- 60TB
And the network. And custom monitoring systems....
Sanger Sequencing Informatics
Current Limiting Factors
- Was RAM per CPU
- Even more often IO - read and write to storage
Avoiding IO Limits for Big Data
Reading and writing data from and to storage/disk (the IO) may slow down down your analyses
- network drive
- shared with other processes/analyses and other users
- a slow drive (like your external USB)
- combinations of the above
Remedies?
- High performance filesystems e.g. Lustre
- Avoid the IO - use pipes (where possible)
Big files and Lots of them...
It's not just that your data files are large - there are likely to be lots of them....
Plexing:
- Default Illumina tagset has 12
- Custom tags are quite common - plexing up to 96 at large sequencing centres
- Illumina introducing pairs of tags so I'd anticipate upto 12*12 (even off a MiSeq)....
To deal with this some automation is required for
Using the shell, creating scripts, is perhaps the easiest way to do this.
Web Based Analysis Pipelines
There are some good web browser accessible tools available:
- PacBio, IonTorrent, MiSeq
- Galaxy -- open and not tied to a particular vendor
These are becoming availble as cloud based services.
Bare in mind
- they are (inevitably?) less flexible
- harder to fix when your interesting data break them
- often use available hardware resources in a simple, potentially wasteful, manner
- analysis pipelines have to be reimplemented when moving between such web based platforms
Questions?