Linux Intro and NGS File Formats

NGS WTAC 2017

Tony Cox and Steven Leonard

WTSI

Shell, Running Commands

Shells ensure convenient and almost universal remote access to unix machine

Navigation and Manipulation of Directories

Navigation and Manipulation of Directories

  • Relative and absolute directory paths
  • Manipulating directories and files
  • mkdir -p seq/data/results; ls -l seq 
    create a directory hierarchy
  • What did the
    -p
    switch do?
  • mv seq/data seq/olddata; ls -l seq 
    renaming a directory (or file)
  • rm -r seq; ls -l seq 
    removing an entire directory hierarchy (be careful!)
  • cp file1 file2; ls -l seq 
    copying a file
  • cp -r seq/data seq/data2; ls -l seq 
    copying a directory recursively
  • cp
    has many important and useful switches!
  • Using a semi-colon allows you to sequentially perform commands without having to wait for each one individually
  • External Drives - working with USB disks

    Security and Permissions

    Useful Tips and Tricks

    Speeding things up

    Fasta Sequence Data Files

    Text File Manipulation

    Fasta Sequence Data Files

    Text File Manipulation Continued

    Fastq Sequence Data Files

    and Using the Shell

    Data Redirection

    Pipes

    Loops

    Compression

    Process Control

    SAM (& BAM) Sequence & Alignment Files

    SAM format

    Samtools

    obtaining and building

    Samtools

    using

    Samtools & CRAM

    Illumina Runfolder Anatomy

    So...

    • Sequence file formats
      • fasta, fastq, sam/bam : you're most likely to work with today
      • cram : smaller than bam, v3 now, next likely common format
      • sff, srf, sra, ztr : there are plenty of other formats which typically contain more "raw" data
      • Illumina: (c)locs, filter, control, bcl → fastq
    • Using the shell
      • is a "universal" language
      • gives you a
        history
      • can get the computer to do the (boring) repetative stuff completely consistently
      • allows you wrap an established procedure in to a script....
    Questions?

    samtools & pileup

    • Converts a read-orientated (sam/bam) file in alignment order to a reference oriented (text/binary) file
    • samtools-1.4/samtools faidx phix-illumina.fa
      create an index for reference fasta file
    • cat phix-illumina.fa.fai 
      it's quite boring for this reference
    • samtools-1.4/samtools mpileup -f phix-illumina.fa s6823_1_phix.bam | less -S

    samtools/bcftools & vcf/bcf

    • Download and uncompress/install bcftools
      • wget https://github.com/samtools/bcftools/releases/download/1.4/bcftools-1.4.tar.bz2
      • tar xjf bcftools-1.4.tar.bz2
      • cd bcftools-1.4; make; cd ..
    • Call variants generate a (text/binary) file
      • samtools-1.4/samtools mpileup -t DP,SP -I -uf phix-illumina.fa -Q 25 s6823_1_phix.bam \
           | bcftools-1.4/bcftools call -vc -

    Samtools & flagstat, stats/bamcheck

    • samtools-1.4/samtools view -u \
      ftp://ngs.sanger.ac.uk/production/WTAC/data/12585_1#21.bam \
      | samtools-1.4/samtools stats - | tee 12585_1#21.bam.bamstats
      pull your data from ftp site, push it uncompressed through bamcheck, and both write bamcheck's output to the terminal and to a file
    • samtools-1.4/misc/plot-bamstats -p 12585_1#21/ 12585_1#21.bam.bamstats
      create some plots using the bamcheck data
    • this doesn't work as gnuplot isn't installed on these machines so
      • wget ftp://ftp.sanger.ac.uk/pub/teams/117/WTAC/data/12585_1%2321.tar.gz
      • tar xvf 12585_1#21.tar.gz
    • file:///tmp/12585_1%2321/index.html

    FastQC

    • From Babraham Institute - neighbours up the road...
    • Download and uncompress/install FastQC
      • wget http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.5.zip
      • unzip fastqc_v0.11.5.zip
      • chmod 755 FastQC/fastqc
    • Try inspecting BAM or fastq files
      • FastQC/fastqc

    IGV

    • From Broad Institute
    • Download and uncompress/install igv
      • wget http://data.broadinstitute.org/igv/projects/downloads/IGV_2.3.92.zip
      • unzip IGV_2.3.92.zip
    • Inspect your BAM file with IGV
    • IGV_2.3.92/igv.sh

    Picard

    • From Broad Institute
    • Download and uncompress/install Picard
      • wget https://github.com/broadinstitute/picard/releases/download/2.9.0/picard.jar
    • N.B. this version requires java 8
    • Try converting from BAM to fastq
    • java -jar picard.jar SamToFastq -h

    Biobambam

    • Download and uncompress/install Biobambam
      • wget https://github.com/gt1/biobambam2/releases/download/2.0.72-release-20170316102450/biobambam2-2.0.72-release-20170316102450-x86_64-etch-linux-gnu.tar.gz
      • tar xvf biobambam2-2.0.72-release-20170316102450-x86_64-etch-linux-gnu.tar.gz
    • ln -s biobambam2-2.0.72-release-20170316102450-x86_64-etch-linux-gnu/bin biobambam
    • Try marking duplicates using Biobambam and compare with Picard
      • ./bammarkduplicates -h