Welcome to SSAHA: Sequence Search and Alignment by Hashing Algorithm Copyright (C) 2003 by Genome Research Limited This software is released under the terms of version 2 of the GNU General Public Licence, as published by the Free Software Foundation. This is SSAHA Version 3.1, released 16th April 2003. NAME ssaha - performs rapid searching of DNA and protein databases SYNOPSIS ssaha [-h] [-help] - print this help message. ssaha queryFile subjectFile [ -optionName_1 [optionValue_1] ] ... [ -optionName_n [optionValue_n] ] - create a hash table from the sequences in subjectFile and use it to search subjectFile for the sequences in queryFile. ssaha subjectFile [ -optionName_1 [optionValue_1] ] ... [ -optionName_n [optionValue_n] ] ... - just create a hash table from the sequences in subjectFile. The -saveName option must be set (see OPTIONS). DESCRIPTION ssaha is a tool for rapidly finding near exact matches in DNA or protein databases. The name is an acronym standing for Sequence Search and Alignment by Hashing Algorithm. It works by converting a sequence database into a hash table. This is then rapidly quizzed for hits, which are concatenated into matches. OPTIONS Options may be specified by either their full names or short names and may appear on the command line in any order. Full Name Short Description -queryFormat -qf Acceptable values: fasta - fasta file fastq - fastq file Default value: If not specified, attempts to deduce file type based on the filename suffix as follows: File suffix Deduced file type .fasta, .fa fasta file .fastq fastq file else assumed to be a directory of files, each of whose names indicates the file type as specified by the above rules. -subjectFormat -sf Acceptable values: fasta - fasta file fastq - fastq file hash - precomputed hash table Default value: If not specified, attempts to deduce file type based on the filename suffix as follows: File suffix Deduced file type .fasta, .fa fasta file .fastq fastq file else assumed to be a directory of files, each of whose names indicates the file type as specified by the above rules. Note: If -sf is set to hash, the -wl, -sl, and -ph options will, if present, be ignored. -queryType -qt Acceptable values: DNA protein Default value: DNA -subjectType -st Acceptable values: DNA (in which case queryType must also be DNA) protein codon (i.e. do 6 way DNA to protein translation) Default value: DNA -hashStats -hs Show information about the hash table currently in use. -parserFriendly -pf Show one match per line as a set of tab delimited (a.k.a. perlFriendly) fields: match direction: F forward, R reverse query name query start query end subject name subject start subject end number of matching bases percentage identity -logMode -lm Controls the output of log information Acceptable values: cerr - send to standard error cout - send to standard output null - suppress log output any other value sends log information to a file of the same name Default value: cerr -packHits -ph Store position of each word in a "packed" format comprising 32 bits per word. This halves the size of the .body file at the expense of a slight decrease in search speed. -wordLength -wl Size in base pairs of the words used to form the hash table. May vary from 1 to (assuming sufficient RAM is available) 16. Default value is 10. -maxGap -mg Maximum gap allowed between successive hits for them to count as part of the same match. Default value is 0. -maxInsert -mi Maximum number of insertions/deletions allowed between successive hits for them to count as part of the same match. Default value is 0. -maxStore -ms Largest number of times that a word may occur in the hash table for it to be used for matching expressed as a multiple of the number of occurrences per word that would be expected for a random database of the same size as the subject database. Default value is 10000. -numRepeats -nr Maximum size of tandem repeating motif that can be detected in the query sequence. This option may produce faster and better matches when dealing with data containing tandem repeats. Defaults to 0, and must be less than or equal to the word length. Notes: 1. This option does nothing if -ph is also set. 2. To get the best results with this option, set -mg to be at least equal to the word length. Setting the -mi option may also help. -minPrint -mp The minimum number of matching bases or residues that must be found in the query and subject sequences before they are considered as a match and thus printed. Default value is 1. -queryStart -qs Specifies the number of the first query sequence to be matched with the subject sequences (numbering of both the query and subject sequences starts at 1). Default value is 1. -queryEnd -qe Specifies the number of the last query sequence to be matched with the subject sequences. If not specified, continues until the end of the query sequence data is reached. -reportMode -rm Specifies behaviour upon encountering unexpected alphanumeric characters in query or subject sequences: ignore - do nothing report - report to standard error replaceA - silently replace character with 'A' replaceG, replaceC, replaceT - as for replaceA rrepA - replace character with 'A' and report rrepG, rrepC, rrepT - as for rrepA Default value is `ignore.' NB FOR VERSION 3.0, THE -reportMode OPTION HAS BEEN SUPERCEDED BY THE -queryReplace AND -subjectReplace OPTIONS - SEE THE APPROPRIATE HELP ENTRIES -reverseQuery -rq When matching the reverse strand of a query, convert the positions of any matches found into the coordinate frame of the forward strand. Has no effect if queryType is set to protein. -saveName -sn Specifies that the hash table must be saved before the program exits. This option must be followed by a string fileNameRoot. The hash table data is saved into the files fileNameRoot.head fileNameRoot.body fileNameRoot.name fileNameRoot.size Notes: 1. If no query file is specified (usage (iii) above) it is an error not to set this option. 2. It is an error to set this option if subjectType is set to `hash.' 3. If the -ph option is also set, the -sn option also produces a fileNameRoot.start file. -sortMatches -sm Output only the top n matches for each query, sorted by number of matching bases, then by subject name, then by start position in the query sequence. Default value is zero, which outputs all matches for each query and does no sorting. -stepLength -sl Number of base pairs gap between words used to produce hash table. Ignored if a precomputed hash table is being used. Default value is equal to wordLength. -queryReplace -qr Specifies behaviour upon encountering unexpected alphanumeric characters in query sequences: ignore - do nothing report - report to standard error A,G, etc. - replace with that character: must be A, G, C, T for DNA, or a valid IUPAC amino acid code for protein. Default: replace with 'A' for DNA, 'X' for protein -subjectReplace -sr Specifies behaviour upon encountering unexpected alphanumeric characters in subject sequences: ignore - do nothing report - report to standard error A,G, etc. - replace with that character Must be A, G, C, T for DNA or a valid IUPAC amino acid code for protein tag - `tag' the word so that it is not put into the hash table. Default: tag -substituteWords -sw Look for single base/amino mismatches in words that occur less than this many times more often than would be expected for a random database of the same size as the subject database. Only looks for: purine (G-A)/pyrimidine (T-C) mismatches for DNA mismatches with positive BLOSUM score for protein Set to zero to switch this feature off. Default value: 0 (switched off) -doAlignment -da Produce a graphical alignment of the matching region using banded dynamic programming. The alignment will be formatted to the specified number of columns. Set to zero to suppress alignments, otherwise must be at least 20. Default value: 80 -bandExtension -be Specify size of the band to use for banded dynamic programming, when producing a graphical alignment. 0 - diagonal only n - n cells each side of diagonal Only has an effect when -be is nonzero Default value: 0 (diagonal only) OUTPUT FORMAT When full alignments are requested (-da set to nonzero) the software produces a line of information then a graphical alignment for each match found: i) DNA against DNA (untranslated) RF p1_1a788a06.q1c 515 538 p1_1a788f11.p1c 493 516 24 100.00 Alignment score: 13 Q:000000515 tttt-tgagacggagtctcgctct ||||x||||||xx||||||||||| S:000000493 ttttttgagacaaagtctcgctct ii) Protein against protein FF SW:PPSA_AERPE 625 642 SW:PPSA_METTH 577 592 8 44.44 Alignment score: 31 Q:000000625 KGGEKYETLDERNPMIGW x|||x |xx |x|||x|| S:000000577 EGGEN-EPY-EHNPMLGW iii) Protein query against translated DNA subject (format for translated DNA query against protein subject is similar) FR SW:PPS2_HUMAN 467 473 p1_1a788c10.q1c 507 529 7 100.00 Alignment score: 22 Q:000000467 E..E..G..--V..L..D..P.. ||||||||| ||||||Nxx||| S:000000507 gaggagggcatgtattaaaccca iv) DNA against DNA (translated) FF p1_1a788a01.p1c 52 76 p1_1a788b11.p1c 314 336 24 96.00 Alignment score: 3 Q:000000052 atggtatgtctttcttttact-agat- ||xV..C..L..S..F.. R.. S:000000314 attgtttgtctctccttc---taga-a From left to right the fields in the match information line are as follows: i) First character: query direction (F forward, R reverse) Second character: subject direction (F forward, R reverse) ii) query name iii) query start iv) query end v) subject name vi) subject start vii) subject end viii) estimated number of matching bases ix) estimated percentage identity The last two quantities are not exact values, they are approximations used to order the matches (if requested) before the full alignment is done. With the alignments switched off (-da 0), an entry like the one below is produced for each sequence in the query. Matches For Query 6 (653 bases): p1_1a788a03.q1c F 6 : p1_1a788a03.q1c Bases: 650 Q: 1 to 650 S: 1 to 650 100.00% R 5 : p1_1a788a03.p1c Bases: 100 Q: 22 to 121 S: 501 to 600 100.00% The top line shows the query number, name and size of the query sequence. Below that is one line for each match found in the subject database. From left to right, the entries on these lines are as follows: match direction: F forward, R reverse subject number subject name number of matching bases query start query end subject start subject end percentage identity Notes: 1. The output format is different if the program is run with the -parserFriendly option set. See the description of that option for details. 2. Because SSAHA works by looking for whole-word matches, the `number of matching bases' and `percentage identity' fields must be considered as lower bounds on the true values of these quantities. FURTHER INFORMATION The SSAHA home page is at http://www.sanger.ac.uk/Software/analysis/SSAHA/ Zemin Ning, Anthony. J. Cox and James C. Mullikin. SSAHA: A Fast Search Method for Large DNA Databases. Submitted to Genome Research.