Handling Large Datasets

NGS WTAC 2011

David Jackson

WTSI

NGS => Large Datasets

If you're dealing with NGS data, you're dealing with "large" datasets.

HiSeq run (208 cycles, 290 GBases)
- 5.5TB off instrument
- 11TB with offline analysis
- 350GB (BAMs) to be kept
MiSeq run (308 cycles, 2.1 GBases)
- 30TB off instrument
- 50GB with offline analysis
- 1.6GB (BAMs) to be kept

Here at the Sanger

3 454s, 1 Pacbio, 1 Iontorrent, 1 MiSeq, several capillary machines
1 Illumina MiSeq, 2 GAs and 23 (and rising?) HiSeqs

To deal with the data from these machines

Automation
Computing infrastructure

Sanger Sequencing Informatics Infrastructure

Compute

2 racks * 16 blades
- 12 CPU
- 36 GB RAM
7 racks * ~14 blades
- 8 CPU
- 12 GB RAM

With a batch queuing system and custom software to push tailored analyses onto it....

Sanger Sequencing Informatics Infrastructure

Storage

27 * storage servers

Samba for instrument connection
NFS for compute farm connection
60TB

And the network. And custom monitoring systems....

Sanger Sequencing Informatics

Current Limiting Factors

Often RAM per CPU
Even more often IO - read and write to storage

Avoiding IO Limits for Big Data

Reading and writing data from and to storage/disk (the IO) may slow down down your analyses

network drive
shared with other processes/analyses and other users
a slow drive (like your external USB)
combinations of the above

Remedies?

High performance filesystems e.g. Lustre
Avoid the IO - use pipes (where possible)

Big files and Lots of them...

It's not just that your data files are large - there are likely to be lots of them....

Plexing:

Default Illumina tagset has 12
Custom tags are quite common - plexing up to 96 at large sequencing centres
Illumina introducing pairs of tags so I'd anticipate upto 12*12 (even off a MiSeq)....

To deal with this some automation is required for

your sanity
consistency

Using the shell, creating scripts, is perhaps the easiest way to do this.

Web Based Analysis Pipelines

There are some good web browser accessible tools available:

PacBio, IonTorrent, MiSeq
Galaxy -- open and not tied to a particular vendor

These are becoming availble as cloud based services.

Bare in mind

they are (inevitably?) less flexible
harder to fix when your interesting data break them
often use available hardware resources in a simple, potentially wasteful, manner
analysis pipelines have to be reimplemented when moving between such web based platforms

Questions?

WTSI -- 8/10/2011

NGS WTAC 2011 -- Handling Large Datasets

Handling Large Datasets

NGS WTAC 2011

David Jackson

WTSI

NGS => Large Datasets

Here at the Sanger

Sanger Sequencing Informatics Infrastructure

Compute

Sanger Sequencing Informatics Infrastructure

Storage

Sanger Sequencing Informatics

Current Limiting Factors

Avoiding IO Limits for Big Data

Big files and Lots of them...

Web Based Analysis Pipelines