NGS File Formats
NGS WTAC 2012
David Jackson
WTSI
Fasta Sequence Data Files
and Text File Manipulation
Fasta Sequence Data Files
and Text File Manipulation
Fastq Sequence Data Files
and Using the Shell
wget ftp://ftp.sanger.ac.uk/pub/team117/WTAC/6823_1_{1,2,t}_phix.fastq
download 3 fastq sequence files with phiX reads
head -n 6 6823_1_{1,2,t}_phix.fastq
to inspect the first few lines of each file
wc 6823_1_*_phix.fastq
to get the number of lines, words, and characters
- How many sequence records?
- Phred qualities: -10*log_10(P_error) beware of different encodings
- Compress to reduce size :
cat 6823_1_1_phix.fastq | gzip -9 > 6823_1_1_phix.fastq.gz
and ls -l 6823_1_1_phix.fastq*
Illumina Runfolder Anatomy
- Structure
- File types: (c)locs, filter, control, bcl, cif
SAM (& BAM) Sequence & Alignment Files
Samtools
wget ftp://ftp.sanger.ac.uk/pub/team117/WTAC/samtools.zip && unzip samtools.zip
download and if download is successful uncompress some samtools binaries (that Thomas Keane has prepared for these Ubuntu PCs)
samtools/samtools view -bS 6823_1_phix.sam > 6823_1_phix.bam
convert SAM to BAM
file 6823_1_phix.bam 6823_1_phix.sam
don't stare into the sun, don't look directly at binary files....
ls -lh 6823_1*
compare the sizes of the files....
samtools/samtools flagstat 6823_1_phix.bam
for some useful stats (dependant on how it's been created)
samtools/samtools index 6823_1_phix.bam
creating an index allows fast reference location access
samtools/samtools tview 6823_1_phix.bam phix-illumina.fa
reference based text viewer -- boring for phiX
So...
- Sequence file formats
- fasta(.gz), fastq(.gz), sam/bam : you're most likely to work with
- cram : smaller than bam, v1.0 now?
- sff, srf, sra, ztr : there are plenty of other formats which typically contain more "raw" data
Questions?
Picard
- Download and uncompress/install Java
- Download and uncompress/install Picard
- Try converting from BAM to fastq