Handling Large Datasets

NGS WTAC 2011

David Jackson

WTSI

NGS => Large Datasets

If you're dealing with NGS data, you're dealing with "large" datasets.

Here at the Sanger

To deal with the data from these machines

Sanger Sequencing Informatics Infrastructure

Compute

With a batch queuing system and custom software to push tailored analyses onto it....

Sanger Sequencing Informatics Infrastructure

Storage

27 * storage servers And the network. And custom monitoring systems....

Sanger Sequencing Informatics

Current Limiting Factors

Avoiding IO Limits for Big Data

Reading and writing data from and to storage/disk (the IO) may slow down down your analyses

Remedies?

Big files and Lots of them...

It's not just that your data files are large - there are likely to be lots of them....

Plexing:

To deal with this some automation is required for

Using the shell, creating scripts, is perhaps the easiest way to do this.

Web Based Analysis Pipelines

There are some good web browser accessible tools available: These are becoming availble as cloud based services.

Bare in mind

Questions?