Handling Large Datasets

NGS WTAC 2014

David Jackson

WTSI

NGS => Large Datasets

If you're dealing with NGS data, you're dealing with "large" datasets.

HiSeq current 8 lane run (208 cycles, 320 GBases)
- 0.5TB (or 5.5TB with 2 * intensities) off instrument
- 2TB for temporary (or 11TB if basecalling and calibrating) with offline analysis
- 150GB (CRAMs) to be kept
HiSeq current rapid 2 lane run (208 cycles, 70 GBases)
- 0.1TB (or 1.2TB with 2 * intensities) off instrument
- 0.3TB for temporary (or 2TB if basecalling and calibrating) with offline analysis
- 30GB (CRAMs) to be kept
MiSeq run (208 cycles, 2.1 GBases)
- 21GB (or 50TB with cif) off instrument
- 35GB with offline analysis
- 1.6GB (BAMs) to be kept

Here at the Sanger

1 Pacbio, 2 Iontorrent PGMs and 1 Ion Proton, several capillary machines
6 Illumina MiSeq, 11 HiSeqs 2500s (3 types for camera), 22 HiSeq 2000s

To deal with the data from these machines

Automation
Computing infrastructure

Sanger Sequencing Informatics Infrastructure

Compute

2 racks * 16 blades
- 12 CPU
- 36 GB RAM
7 racks * ~14 blades
- 8 CPU
- 12 GB RAM

With a batch queuing system and custom software to push tailored analyses onto it....

Sanger Sequencing Informatics Infrastructure

Storage

27 * storage servers

Samba for instrument connection
NFS for compute farm connection
60TB

And the network. And custom monitoring systems....

Sanger Sequencing Informatics

Current Limiting Factors

Was RAM per CPU
Even more often IO - read and write to storage

Avoiding IO Limits for Big Data

Reading and writing data from and to storage/disk (the IO) may slow down down your analyses

network drive
shared with other processes/analyses and other users
a slow drive (like your external USB)
combinations of the above

Remedies?

High performance filesystems e.g. Lustre
Avoid the IO - use pipes (where possible)

Big files and Lots of them...

It's not just that your data files are large - there are likely to be lots of them....

Plexing:

Default Illumina tagset has 12
Custom tags are quite common - plexing up to 96 at large sequencing centres
Illumina introducing pairs of tags so I'd anticipate upto 12*12 (even off a MiSeq)....

To deal with this some automation is required for

your sanity
consistency

Using the shell, creating scripts, is perhaps the easiest way to do this.

Web Based Analysis Pipelines

There are some good web browser accessible tools available:

PacBio, IonTorrent, MiSeq
Galaxy -- open and not tied to a particular vendor

These are becoming availble as cloud based services.

Bare in mind

they are (inevitably?) less flexible
harder to fix when your interesting data break them
often use available hardware resources in a simple, potentially wasteful, manner
analysis pipelines have to be reimplemented when moving between such web based platforms

Questions?

WTSI -- 9/04/2014

NGS WTAC 2014 -- Handling Large Datasets

Handling Large Datasets

NGS WTAC 2014

David Jackson

WTSI

NGS => Large Datasets

Here at the Sanger

Sanger Sequencing Informatics Infrastructure

Compute

Sanger Sequencing Informatics Infrastructure

Storage

Sanger Sequencing Informatics

Current Limiting Factors

Avoiding IO Limits for Big Data

Big files and Lots of them...

Web Based Analysis Pipelines