Welcome to SSAHA: Sequence Search and Alignment by Hashing Algorithm
Copyright (C) 2003 by Genome Research Limited
This software is released under the terms of version 2 of the GNU General
Public Licence, as published by the Free Software Foundation.
This is SSAHA Version 3.1, released 16th April 2003.

NAME

 ssaha - performs rapid searching of DNA and protein databases

SYNOPSIS

 ssaha [-h] [-help]

 - print this help message.

 ssaha queryFile subjectFile [ -optionName_1 [optionValue_1] ] ...
   [ -optionName_n [optionValue_n] ] 

 - create a hash table from the sequences in subjectFile and
   use it to search subjectFile for the sequences in queryFile.

 ssaha subjectFile [ -optionName_1 [optionValue_1] ] ...
   [ -optionName_n [optionValue_n] ] ...

 - just create a hash table from the sequences in subjectFile.
   The -saveName option must be set (see OPTIONS).

DESCRIPTION

 ssaha is a tool for rapidly finding near exact matches in DNA or protein
 databases. The name is an acronym standing for Sequence Search and Alignment
 by Hashing Algorithm. It works by converting a sequence database into a
 hash table. This is then rapidly quizzed for hits, which are concatenated
 into matches.

OPTIONS

 Options may be specified by either their full names or short names and may
 appear on the command line in any order.

 Full Name      Short      Description

-queryFormat   -qf        Acceptable values:
                          fasta - fasta file
                          fastq - fastq file

                          Default value:
                          If not specified, attempts to deduce file type
                          based on the filename suffix as follows:

                          File suffix    Deduced file type
                          .fasta, .fa    fasta file
                          .fastq         fastq file
                          else assumed to be a directory of files, each of
                          whose names indicates the file type as specified
                          by the above rules.

-subjectFormat -sf        Acceptable values:
                          fasta - fasta file
                          fastq - fastq file
                          hash  - precomputed hash table

                          Default value:
                          If not specified, attempts to deduce file type
                          based on the filename suffix as follows:

                          File suffix    Deduced file type
                          .fasta, .fa    fasta file
                          .fastq         fastq file
                          else assumed to be a directory of files, each of
                          whose names indicates the file type as specified
                          by the above rules.

                          Note:
                          If -sf is set to hash, the -wl, -sl, and -ph
                          options will, if present, be ignored.

-queryType     -qt        Acceptable values:
                          DNA
                          protein

                          Default value:
                          DNA

-subjectType   -st        Acceptable values:
                          DNA (in which case queryType must also be DNA)
                          protein
                          codon (i.e. do 6 way DNA to protein translation)

                          Default value:
                          DNA

-hashStats     -hs        Show information about the hash table currently
                          in use.

-parserFriendly -pf       Show one match per line as a set of tab delimited
 (a.k.a. perlFriendly)    fields:

                          match direction: F forward, R reverse
                          query name
                          query start
                          query end
                          subject name
                          subject start
                          subject end
                          number of matching bases
                          percentage identity

-logMode       -lm        Controls the output of log information
                          Acceptable values:
                          cerr - send to standard error
                          cout - send to standard output
                          null - suppress log output
                          any other value sends log information to a file
                          of the same name

                          Default value:
                          cerr

-packHits      -ph        Store position of each word in a "packed"
                          format comprising 32 bits per word. This halves
                          the size of the .body file at the expense of a
                          slight decrease in search speed.

-wordLength    -wl        Size in base pairs of the words used to form
                          the hash table. May vary from 1 to (assuming
                          sufficient RAM is available) 16.
                          Default value is 10.

-maxGap        -mg        Maximum gap allowed between successive hits for
                          them to count as part of the same match.
                          Default value is 0.

-maxInsert     -mi        Maximum number of insertions/deletions allowed
                          between successive hits for them to count as part
                          of the same match.
                          Default value is 0.

-maxStore      -ms        Largest number of times that a word may occur in
                          the hash table for it to be used for matching
                          expressed as a multiple of the number of
                          occurrences per word that would be expected
                          for a random database of the same size as the
                          subject database.
                          Default value is 10000.

-numRepeats    -nr        Maximum size of tandem repeating motif that can be
                          detected in the query sequence. This option may
                          produce faster and better matches when dealing
                          with data containing tandem repeats.
                          Defaults to 0, and must be less than or equal to
                          the word length.
                          Notes:
                          1. This option does nothing if -ph is also set.
                          2. To get the best results with this option, set
                          -mg to be at least equal to the word length.
                          Setting the -mi option may also help.

-minPrint      -mp        The minimum number of matching bases or residues
                          that must be found in the query and subject
                          sequences before they are considered as a match
                          and thus printed.
                          Default value is 1.

-queryStart    -qs        Specifies the number of the first query sequence to
                          be matched with the subject sequences (numbering of
                          both the query and subject sequences starts at 1).
                          Default value is 1.

-queryEnd      -qe        Specifies the number of the last query sequence to
                          be matched with the subject sequences. If not
                          specified, continues until the end of the query
                          sequence data is reached.

-reportMode    -rm        Specifies behaviour upon encountering unexpected
                          alphanumeric characters in query or subject 
                          sequences:

                          ignore - do nothing
                          report - report to standard error
                          replaceA   - silently replace character with 'A'
                          replaceG, replaceC, replaceT - as for replaceA
                          rrepA   - replace character with 'A' and report
                          rrepG, rrepC, rrepT - as for rrepA
                          Default value is `ignore.'

                          NB FOR VERSION 3.0, THE -reportMode OPTION HAS BEEN
                          SUPERCEDED BY THE -queryReplace AND -subjectReplace
                          OPTIONS - SEE THE APPROPRIATE HELP ENTRIES

-reverseQuery  -rq        When matching the reverse strand of a query,
                          convert the positions of any matches found
                          into the coordinate frame of the forward strand.
                          Has no effect if queryType is set to protein.

-saveName      -sn        Specifies that the hash table must be saved before
                          the program exits. This option must be followed by
                          a string fileNameRoot. The hash table data is
                          saved into the files
                            fileNameRoot.head
                            fileNameRoot.body
                            fileNameRoot.name
                            fileNameRoot.size

                          Notes:
                          1. If no query file is specified (usage (iii)
                          above) it is an error not to set this option.
                          2. It is an error to set this option if
                          subjectType is set to `hash.'
                          3. If the -ph option is also set, the -sn option
                          also produces a fileNameRoot.start file.

-sortMatches   -sm        Output only the top n matches for each query,
                          sorted by number of matching bases, then by
                          subject name, then by start position in the
                          query sequence.
                          Default value is zero, which outputs all matches
                          for each query and does no sorting.

-stepLength    -sl        Number of base pairs gap between words used to 
                          produce hash table. Ignored if a precomputed 
                          hash table is being used. Default value is 
                          equal to wordLength.

-queryReplace  -qr        Specifies behaviour upon encountering unexpected
                          alphanumeric characters in query sequences:

                          ignore - do nothing
                          report - report to standard error
                          A,G, etc. - replace with that character:
                          must be A, G, C, T for DNA, or a valid IUPAC
                          amino acid code for protein.
                          Default: replace with 'A' for DNA, 'X' for protein

-subjectReplace -sr       Specifies behaviour upon encountering unexpected
                          alphanumeric characters in subject sequences:

                          ignore - do nothing
                          report - report to standard error
                          A,G, etc. - replace with that character
                          Must be A, G, C, T for DNA or a valid IUPAC
                          amino acid code for protein
                          tag - `tag' the word so that it is not put
                          into the hash table.
                          Default: tag

-substituteWords -sw      Look for single base/amino mismatches in words
                          that occur less than this many times more often
                          than would be expected for a random database of
                          the same size as the subject database.

                          Only looks for:
                          purine (G-A)/pyrimidine (T-C) mismatches for DNA
                          mismatches with positive BLOSUM score for protein
                          Set to zero to switch this feature off.
                          Default value: 0 (switched off)

-doAlignment   -da        Produce a graphical alignment of the matching region
                          using banded dynamic programming. The alignment
                          will be formatted to the specified number of columns.
                          Set to zero to suppress alignments, otherwise
                          must be at least 20.
                          Default value: 80

-bandExtension -be        Specify size of the band to use for banded dynamic
                          programming, when producing a graphical alignment.
                          0 - diagonal only
                          n - n cells each side of diagonal
                          Only has an effect when -be is nonzero
                          Default value: 0 (diagonal only)
OUTPUT FORMAT

When full alignments are requested (-da set to nonzero) the software produces
a line of information then a graphical alignment for each match found:

i) DNA against DNA (untranslated)

RF      p1_1a788a06.q1c 515     538     p1_1a788f11.p1c 493     516     24
100.00
Alignment score: 13
Q:000000515 tttt-tgagacggagtctcgctct
            ||||x||||||xx|||||||||||
S:000000493 ttttttgagacaaagtctcgctct

ii) Protein against protein

FF      SW:PPSA_AERPE   625     642     SW:PPSA_METTH   577     592     8  
44.44
Alignment score: 31
Q:000000625 KGGEKYETLDERNPMIGW
            x|||x |xx |x|||x||
S:000000577 EGGEN-EPY-EHNPMLGW

iii) Protein query against translated DNA subject 
(format for translated DNA query against protein subject is similar)

FR      SW:PPS2_HUMAN   467     473     p1_1a788c10.q1c 507     529     7
100.00
Alignment score: 22
Q:000000467 E..E..G..--V..L..D..P..
            |||||||||  ||||||Nxx|||
S:000000507 gaggagggcatgtattaaaccca

iv) DNA against DNA (translated)

FF      p1_1a788a01.p1c 52      76      p1_1a788b11.p1c 314     336     24
96.00
Alignment score: 3
Q:000000052 atggtatgtctttcttttact-agat-
            ||xV..C..L..S..F..    R..  
S:000000314 attgtttgtctctccttc---taga-a

 From left to right the fields in the match information line are as follows:

 i)    First character: query direction (F forward, R reverse)
      Second character: subject direction (F forward, R reverse)
 ii)   query name
 iii)  query start
 iv)   query end
 v)    subject name
 vi)   subject start
 vii)  subject end
 viii) estimated number of matching bases
 ix)   estimated percentage identity

 The last two quantities are not exact values, they are approximations used
 to order the matches (if requested) before the full alignment is done.

 With the alignments switched off (-da 0), an entry like the one below is
 produced for each sequence in the query.

 Matches For Query 6 (653 bases): p1_1a788a03.q1c

 F 6 : p1_1a788a03.q1c  Bases: 650   Q: 1 to 650    S: 1 to 650     100.00%
 R 5 : p1_1a788a03.p1c  Bases: 100   Q: 22 to 121   S: 501 to 600   100.00%

 The top line shows the query number, name and size of the query sequence.
 Below that is one line for each match found in the subject database. From
 left to right, the entries on these lines are as follows:

 match direction: F forward, R reverse
 subject number
 subject name
 number of matching bases
 query start
 query end
 subject start
 subject end
 percentage identity

Notes:

 1. The output format is different if the program is run with the
 -parserFriendly option set. See the description of that option for details.

 2. Because SSAHA works by looking for whole-word matches, the `number of
 matching bases' and `percentage identity' fields must be considered as lower
 bounds on the true values of these quantities.

FURTHER INFORMATION

 The SSAHA home page is at http://www.sanger.ac.uk/Software/analysis/SSAHA/

 Zemin Ning, Anthony. J. Cox and James C. Mullikin. SSAHA: A Fast Search
 Method for Large DNA Databases. Submitted to Genome Research.