To enable efficient editing of data, Gap5 needs its own database
format for storing sequence assemblies. Formats such as BAM are good
at random access for read-only viewing, but are not at all amenable to
actions such as reverse complementing a contig and joining it to
another.
Hence we need a tool that can take existing assembly formats and
convert them to a form suitable for Gap5. The tg_index
program
performs this task. It is strictly a command line tool, although in
some specific cases Gap5 has basic GUI dialogues to wrap it up.
One or more input files may be specified. The general form is:
tg_index
[options] -o
gap5_db_name
input_file_name ...
An example usage is:
tg_index -z 16384 -o test_data.g5 test_data.bam
gap5 test_data.g5 &
File formats supported are SAM, BAM, ACE, MAQ (both short and long
variants), CAF, BAF, Fasta and Fastq. The latter two have no assembly
and/or alignment information so they are simply loaded as single-read
contigs instead. Tg_index typically automatically detects the type of
file, but in rare cases you may need to explicitly state the input
file type.
Tg_index options:
- -o filename
-
Creates a gap5 database named filename and filename
.aux
If not specified the default is "g_db".
- -a
-
Append to an existing database, instead of creating a new one (which
is the default action).
- -n
-
When appending, the default behaviour is to add reads to existing
contigs if contigs with the appropriate names already exist. This
option always forces creation of new contigs instead.
- -g
-
When appending to an existing database, assume that the alignment has
been performed against an ungapped copy of the consensus exported from
this database. (This is internally used when performing mapped
assemblies as they consist of exporting the consensus, running the
external mapped alignment tool, and then importing the newly generated
alignments.)
- -m
-
- -M
-
Forces the input to be treated as MAQ, both short (-m) and long (-M)
formats are supported. By default the file format is automatically
detected.
- -A
-
Forces the input to be treads as ACE format.
- -B
-
Forces the input to be treads as BAF format.
- -C
-
Forces the input to be treads as CAF format.
- -b
-
- -s
-
Forces the input to be treads as BAM (-b) or SAM (-s) format. SAM must
have @SQ headers present. Both need to be sorted by position.
- -z bin_size
-
Modifies the size of the smallest allowable contig bin. Large contigs
will contain child bins, each of which will contain smaller bins,
recursing down to a minimum bin size. Sequences are then placed in the
smallest bin they entirely fit within. The default minimum bin size is
4096 bytes. For very shallow assemblies increasing this will improve
performance and the decrease disk space used. Ideally 5,000 to 10,000
sequences per bin is an approximate figure to aim for.
- -u
-
Store unmapped reads only (from SAM/BAM only)
- -x
-
Store SAM/BAM auxillary key:value records too.
- -p
-
- -P
-
Enable (-p) or disable (-P) read-pairing. By default this is
enabled. The purpose of this is to link sequences from the same
template to each other such that gap5 knows the insert size and
read-pairings. Generally this is desirable, but it adds extra time and
memory to identify the pairs. Hence for single-ended runs the option
exists to disable attempts at read-pairing.
- -f
-
Attempt a faster form of read-pairing. In this mode we link the second
occurrence of a template to the first occurrence, but not vice
versa. This is sufficient for the template display graphical views to
work, but will cause other parts of the program to behave
inconsistently. For example the contig editor "goto..." popup menu
will sometimes be missing.
- -t
-
- -T
-
Controls whether to index (-t) or not (-T) the sequence names. By
default this is disabled. Adding a sequence name index permits us to
search by sequence name or to use a sequence name in any dialogue that
requires a contig identifier. However it consumes more disc space to
store this index and it can be time consuming to construct it.
- -r nseq
-
Reserves space for at least nseq sequences. This generally isn't
necessary, but if the total number of records extends above 2 million
(equivalent to 2 billion sequences, or less if we have lots of
contigs, bins and annotation records to write) then we run out of
suitable sequence record numbers. This option preallocates the lower
record numbers and reserves them solely for sequence records.
- -c compression_method
-
Specifies an alternate compression method. This defaults to zlib,
but can be set to either none for fastest speed or lzma for
best compression.
Last generated on 25 November 2011.