NGS File Formats

NGS WTAC 2012

David Jackson

WTSI

Fasta Sequence Data Files

and Text File Manipulation

Fasta Sequence Data Files

and Text File Manipulation

Fastq Sequence Data Files

and Using the Shell

Illumina Runfolder Anatomy

SAM (& BAM) Sequence & Alignment Files

  • http://samtools.sourceforge.net/ for the specification, links to programs, and other resources
  • Samtools and Picard are the two leading suites for dealing with SAM files
  • Text (typically .sam files) and binary (typically .bam files) formats
  • wget ftp://ftp.sanger.ac.uk/pub/team117/WTAC/6823_1_phix.sam
  • head 6823_1_phix.sam
    header records start with @, tab delimited fields identified by two character tags followed by a :
  • grep -v -E '^@' 6823_1_phix.sam | head
    separate sequence/alignment data records for paired reads (subreads), tab delimited fields (11 mandatory folowed by optional tag identified fields)

Samtools

  • wget ftp://ftp.sanger.ac.uk/pub/team117/WTAC/samtools.zip && unzip samtools.zip
    download and if download is successful uncompress some samtools binaries (that Thomas Keane has prepared for these Ubuntu PCs)
  • samtools/samtools view -bS 6823_1_phix.sam  > 6823_1_phix.bam 
    convert SAM to BAM
  • file 6823_1_phix.bam 6823_1_phix.sam
    don't stare into the sun, don't look directly at binary files....
  • ls -lh 6823_1* 
    compare the sizes of the files....
  • samtools/samtools flagstat 6823_1_phix.bam
    for some useful stats (dependant on how it's been created)
  • samtools/samtools index 6823_1_phix.bam 
    creating an index allows fast reference location access
  • samtools/samtools tview 6823_1_phix.bam phix-illumina.fa
    reference based text viewer -- boring for phiX

So...

  • Sequence file formats
    • fasta(.gz), fastq(.gz), sam/bam : you're most likely to work with
    • cram : smaller than bam, v1.0 now?
    • sff, srf, sra, ztr : there are plenty of other formats which typically contain more "raw" data
Questions?

Picard

  • Download and uncompress/install Java
  • Download and uncompress/install Picard
  • Try converting from BAM to fastq