Linux Intro and NGS File Formats

NGS WTAC 2016

Tony Cox and Steven Leonard

WTSI

Shell, Running Commands

N.b. easy and quite universal remote access

Navigation and Manipulation of Directories

External Drives

Fasta Sequence Data Files

and Text File Manipulation

Fasta Sequence Data Files

and Text File Manipulation

Fastq Sequence Data Files

and Using the Shell

Pipes and Data Redirection

Process Control

SAM (& BAM) Sequence & Alignment Files

SAM format

Samtools

obtaining and building

Samtools

using

Samtools & CRAM

Illumina Runfolder Anatomy

So...

  • Sequence file formats
    • fasta, fastq, sam/bam : you're most likely to work with today
    • cram : smaller than bam, v3 now, next likely common format
    • sff, srf, sra, ztr : there are plenty of other formats which typically contain more "raw" data
    • Illumina: (c)locs, filter, control, bcl → fastq
  • Using the shell
    • is a "universal" language
    • gives you a
      history
    • can get the computer to do the (boring) repetative stuff completely consistently
    • allows you wrap an established procedure in to a script....
Questions?

Samtools & flagstat, stats/bamcheck

  • samtools-1.3/samtools view -u \
    ftp://ngs.sanger.ac.uk/scratch/project/WTAC/processed_data/12585_1#21.bam \
    | samtools-1.3/samtools stats - | tee 12585_1#21.bam.bamstats
    pull your data from ftp site, push it uncompressed through bamcheck, and both write bamcheck's output to the terminal and to a file
  • samtools-1.3/misc/plot-bamstats -p 12585_1#21/ 12585_1#21.bam.bamstats
    create some plots using the bamcheck data

Picard

  • From Broad Institute
  • Download and uncompress/install Picard
    • wget https://github.com/broadinstitute/picard/releases/download/1.141/picard-tools-1.141.zip
    • unzip picard-tools-1.141.zip
    • java -jar picard-tools-1.141/picard.jar -h
  • Try converting from BAM to fastq
  • java -jar picard-tools-1.141/picard.jar SamToFastq -h

FastQC

  • From Babraham Institute - neighbours up the road...
  • Download and uncompress/install FastQC
    • wget http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.5.zip
    • unzip fastqc_v0.11.5.zip
    • chmod 755 fastqc
  • Try inspecting BAM or fastq files
    • FastQC/fastq

samtools & pileup

  • Download and uncompress/install samtools
    • samtools-1.3/samtools faidx phix-illumina.fa
      create an index for reference fasta file
    • cat phix-illumina.fa.fai 
      it's quite boring for this reference
    • samtools-1.3/samtools mpileup -f phix-illumina.fa s6823_1_phix.bam | less -S

IGV

  • From Broad Institute
  • Download and uncompress/install igv
    • wget ftp://ngs.sanger.ac.uk/scratch/project/WTAC/software/IGV_2.3.69.zip
    • gunzip IGV_2.3.69.zip
  • Inspect your BAM file with IGV
  • IGV_2.3.69/igv.sh

Biobambam

  • Download and uncompress/install Biobambam
    • wget https://github.com/gt1/biobambam2/releases/download/2.0.31-release-20160307150858/biobambam2-2.0.31-release-20160307150858-x86_64-etch-linux-gnu.tar.gz
    • tar xvf biobambam2-2.0.31-release-20160307150858-x86_64-etch-linux-gnu.tar.gz
  • ln -s biobambam2-2.0.31-release-20160307150858-x86_64-etch-linux-gnu/bin biobambam2
  • Try marking duplicates using Biobambam and compare with Picard
    • biobambam2/bammarkduplicates -h