Linux Intro and NGS File Formats
NGS WTAC 2015
David Jackson
WTSI
Shell, Running Commands
- Open a terminal to get a shell
- Running commands
w
gives machine load
ps
or ps x
or ps aux
to see processes
- Help for commands
man ps
or just web search....
- Different shells:
tcsh
, bash
- Leaving a shell:
exit
, ctrl-D
N.b. easy and quite universal remote access
Navigation and Manipulation of Directories
ls -lart
to see files in current directory. Nb "." and ".."
ls /usr/bin
or in another directory
ls /usr/bin/x*
or files matching a wildcard
pwd
to see current location
cd /usr/bin; pwd; cd; pwd; cd -; pwd; cd
move around hierarchy
mkdir -p super/dooper/results; ls -l super
creating a hierarchy
mv super/dooper super/duper; ls -l super
renaming a directory (or file)
rm -r super; ls -l super
removing a hierarchy
- Using a semi-colon allows you to perform a series of commands without having to wait for each one to complete
External Drives
- Plug in your USB drive
df -T -h
can help you find where it is "mounted"
- Change to that directory
- Make and move into a directory
- When finished, remember to move out of the directory (close shell or change directory) and then tell the OS to "safely remove" unmount the usb drive
- Now move back to your home directory, then make and move into a directory for this tutorial
Fasta Sequence Data Files
and Text File Manipulation
wget ftp://ftp.sanger.ac.uk/pub/teams/117/WTAC/phix-illumina.fa
let's download a fasta sequence file for phiX (from the command line or just use your browser)
head -n 5 phix-illumina.fa
to inspect the first few lines
tail -n 5 phix-illumina.fa
to inspect the last few lines
wc phix-illumina.fa
to get the number of lines, words, and characters
Fasta Sequence Data Files
and Text File Manipulation
wget 'ftp://ftp.sanger.ac.uk/pub/teams/117/WTAC/adapters.fasta'
another fasta file with adapter sequence
cat adapters.fasta
spill the whole file to the terminal
less adapters.fasta
simple viewer for the terminal
grep -c -E '^>' adapters.fasta
count the number of sequence records
grep -A 2 RNA adapters.fasta
show (first line of) RNA related records
Fastq Sequence Data Files
and Using the Shell
echo foo{1,3,5}{A,B} okay
prints the shell expanded strings
wget ftp://ftp.sanger.ac.uk/pub/teams/117/WTAC/s6823_1_{1,2,t}_phix.fastq
download 3 fastq sequence files with phiX reads
head -n 6 s6823_1_{1,2,t}_phix.fastq
to inspect the first few lines of each file
wc s6823_1_*_phix.fastq
to get the number of lines, words, and characters
- How many sequence records?
- Phred qualities: -10*log_10(P_error) beware of different encodings
Pipes and Data Redirection
md5sum s6823_1_1_phix.fastq > s6823_1_1_phix.fastq.md5
output of the md5sum program is put in a file
cat s6823_1_1_phix.fastq.md5
for f in *.fastq; do md5sum $f > $f.md5; done
loop through all the fastq files creating corresponding md5 files
cat s6823_1_{1,2,t}_phix.fastq | wc
output of a process can also be redirected to another process
tar cf - s6823_1_{1,2,t}_phix.fastq | gzip -9 > s6823_1_phix.tgz
uses a pipe and a redirect to create a compressed archive of the fastq files (reduced IO)
gunzip -c s6823_1_phix.tgz | tar tf -
lists the contents of the archive
Process Control
xeyes
then ctrl-C to terminate it
xeyes
then
- ctrl-Z to suspend the process
bg
to allow it to continue running in the background
xeyes &
to start a process running in the background
jobs
to see background processes, jobs, of this shell
ps
shows processes of this session
kill %1
to terminate the first xeyes
kill
2nd xeyes PID for the second
top
for a full terminal updating display of processes running on the system
SAM (& BAM) Sequence & Alignment Files
Samtools
obtaining and building
wget https://github.com/samtools/samtools/releases/download/1.2/samtools-1.2.tar.bz2 \
&& tar xjf samtools-1.2.tar.bz2
download and if download is successful uncompress some samtools source
cd samtools-1.2; ls; make -j 2
ls; cd ..
samtools-1.2/samtools
samtools-1.2/samtools help view
man samtools-1.2/samtools.1
Samtools
using
samtools-1.2/samtools view -bS s6823_1_phix.sam > s6823_1_phix.bam
convert SAM to BAM
file s6823_1_phix.bam s6823_1_phix.sam
not normally interesting to look directly in binary files....
ls -lh s6823_1*
compare the sizes of the files....
samtools-1.2/samtools flagstat s6823_1_phix.bam
for some useful stats (dependant on how it's been created)
samtools-1.2/samtools index s6823_1_phix.bam
creating an index allows fast reference location access
samtools-1.2/samtools tview s6823_1_phix.bam phix-illumina.fa
reference based text viewer -- boring for phiX
Illumina Runfolder Anatomy
- Structure
- Data file types: (c)locs, filter, control, bcl, cif
So...
- Sequence file formats
- fasta, fastq, sam/bam : you're most likely to work with today
- cram : smaller than bam, v2.1 now, next likely common format
- sff, srf, sra, ztr : there are plenty of other formats which typically contain more "raw" data
- Illumina: (c)locs, filter, control, bcl → fastq
- Using the shell
- is a "universal" language
- gives you a
history
- can get the computer to do the (boring) repetative stuff completely consistently
- allows you wrap an established procedure in to a script....
Questions?
Samtools & flagstat, stats/bamcheck
samtools-1.2/samtools view -u \
ftp://ngs.sanger.ac.uk/scratch/project/WTAC/processed_data/12585_1#21.bam \
| samtools-1.2/stats - | tee 12585_1#21.bam.bamstats
pull your data from ftp site, push it uncompressed through bamcheck, and both write bamcheck's output to the terminal and to a file
samtools/misc/plot-bamstats -p 12585_1#21/ 12585_1#21.bam.bamstats
create some plots using the bamcheck data
Picard
- Download and uncompress/install Picard
- Try converting from BAM to fastq
FastQC
- From Babraham Institute - neighbours up the road...
- Download and uncompress/install FastQC
- Try inspecting BAM or fastq files
IGV
- Inspect your BAM file with IGV
Biobambam
- Download and uncompress/install Biobambam
- Try marking duplicates using Biobambam and compare with Picard