Linux Intro and File Formats
NGS WTAC 2011
David Jackson
WTSI
Shell, Running Commands
- Open a terminal to get a shell
- Running commands
w
gives machine load
ps
or ps x
or ps aux
to see processes
- Help for commands
man ps
or just web search....
- Different shells:
tcsh
, bash
- Leaving a shell:
exit
, ctrl-D
N.b. easy and quite universal remote access
Navigation and Manipulation of Directories
ls -lart
to see files in current directory. Nb "." and ".."
ls /usr/bin
or in another directory
ls /usr/bin/x*
or files matching a wildcard
pwd
to see current location
cd /usr/bin; pwd; cd; pwd; cd -; pwd; cd
move around hierarchy
mkdir -p super/dooper/results; ls -l super
creating a hierarchy
mv super/dooper super/duper; ls -l super
renaming a directory (or file)
rm -r super; ls -l super
removing a hierarchy
- Using a semi-colon allows you to perform a series of commands without having to wait for each one to complete
External Drives
- Plug in your USB drive
df -T -h
can help you find where it is "mounted"
- Change to that directory
- Make and move into a directory for this tutorial
- When finished, remember to move out of the directory (close shell or change directory) and then tell the OS to "safely remove" unmount the usb drive
Fasta Sequence Data Files
and Text File Manipulation
Fasta Sequence Data Files
and Text File Manipulation
Fastq Sequence Data Files
and Using the Shell
echo foo{1,3,5}{A,B} okay
prints the shell expanded strings
wget ftp://ftp.sanger.ac.uk/pub/team117/WTAC/6823_1_{1,2,t}_phix.fastq
download 3 fastq sequence files with phiX reads
head -n 6 6823_1_{1,2,t}_phix.fastq
to inspect the first few lines of each file
wc 6823_1_*_phix.fastq
to get the number of lines, words, and characters
- How many sequence records?
- Phred qualities: -10*log_10(P_error) beware of different encodings
Pipes and Data Redirection
md5sum 6823_1_1_phix.fastq > 6823_1_1_phix.fastq.md5
output of the md5sum program is put in a file
cat 6823_1_1_phix.fastq.md5
for f in *.fastq; do md5sum $f > $f.md5; done
loop through all the fastq files creating corresponding md5 files
cat 6823_1_{1,2,t}_phix.fastq | wc
output of a process can also be redirected to another process
tar cf - 6823_1_{1,2,t}_phix.fastq | gzip -9 > 6823_1_phix.tgz
uses a pipe and a redirect to create a compressed archive of the fastq files (reduced IO)
gunzip -c 6823_1_phix.tgz | tar tf -
lists the contents of the archive
Process Control
xeyes
then ctrl-C to terminate it
xeyes
then
- ctrl-Z to suspend the process
bg
to allow it to continue running in the background
xeyes &
to start a process running in the background
jobs
to see background processes, jobs, of this shell
ps
shows processes of this session
kill %1
to terminate the first xeyes
kill
2nd xeyes PID for the second
top
for a full terminal updating display of processes running on the system
SAM (& BAM) Sequence & Alignment Files
Samtools
wget ftp://ftp.sanger.ac.uk/pub/team117/WTAC/samtools.zip && unzip samtools.zip
download and if download is successful uncompress some samtools binaries (that Thomas Keane has prepared for these Ubuntu PCs)
samtools/samtools view -b -T phix-illumina.fa 6823_1_phix.sam > new.bam
convert SAM to BAM
file new.bam 6823_1_phix.sam
don't stare into the sun, don't look directly at binary files....
ls -lh 6823_1* new.bam
compare the sizes of the files....
samtools/samtools flagstat new.bam
for some useful stats (dependant on how it's been created)
samtools/samtools index 6823_1_phix.bam
creating an index allows fast reference location access
samtools/samtools tview 6823_1_phix.bam phix-illumina.fa
reference based text viewer -- boring for phiX
So...
- Sequence file formats
- fasta, fastq, sam/bam : you're most likely to work with
- sff, srf, sra, ztr : there are plenty of other formats which typically contain more "raw" data
- Using the shell
- is a "universal" language
- gives you a
history
- can get the computer to do the (boring) repetative stuff completely consistently
- allows you wrap an established procedure in to a script....
Questions?
Picard
- Download and uncompress/install Java
- Download and uncompress/install Picard
- Try converting from BAM to fastq