first previous next last contents

Importing with tg_index

To enable efficient editing of data, Gap5 needs its own database format for storing sequence assemblies. Formats such as BAM are good at random access for read-only viewing, but are not at all amenable to actions such as reverse complementing a contig and joining it to another.

Hence we need a tool that can take existing assembly formats and convert them to a form suitable for Gap5. The tg_index program performs this task. It is strictly a command line tool, although in some specific cases Gap5 has basic GUI dialogues to wrap it up.

One or more input files may be specified. The general form is:

tg_index [options] -o gap5_db_name input_file_name ...

An example usage is:

    tg_index -z 16384 -o test_data.g5 test_data.bam
    gap5 test_data.g5 &

File formats supported are SAM, BAM, ACE, MAQ (both short and long variants), CAF, BAF, Fasta and Fastq. The latter two have no assembly and/or alignment information so they are simply loaded as single-read contigs instead. Tg_index typically automatically detects the type of file, but in rare cases you may need to explicitly state the input file type.

Tg_index options:

-o filename
Creates a gap5 database named filename and filename.aux If not specified the default is "g_db".
-a
Append to an existing database, instead of creating a new one (which is the default action).
-n
When appending, the default behaviour is to add reads to existing contigs if contigs with the appropriate names already exist. This option always forces creation of new contigs instead.
-g
When appending to an existing database, assume that the alignment has been performed against an ungapped copy of the consensus exported from this database. (This is internally used when performing mapped assemblies as they consist of exporting the consensus, running the external mapped alignment tool, and then importing the newly generated alignments.)
-m
-M
Forces the input to be treated as MAQ, both short (-m) and long (-M) formats are supported. By default the file format is automatically detected.
-A
Forces the input to be treads as ACE format.
-B
Forces the input to be treads as BAF format.
-C
Forces the input to be treads as CAF format.
-b
-s
Forces the input to be treads as BAM (-b) or SAM (-s) format. SAM must have @SQ headers present. Both need to be sorted by position.
-z bin_size
Modifies the size of the smallest allowable contig bin. Large contigs will contain child bins, each of which will contain smaller bins, recursing down to a minimum bin size. Sequences are then placed in the smallest bin they entirely fit within. The default minimum bin size is 4096 bytes. For very shallow assemblies increasing this will improve performance and the decrease disk space used. Ideally 5,000 to 10,000 sequences per bin is an approximate figure to aim for.
-u
Store unmapped reads only (from SAM/BAM only)
-x
Store SAM/BAM auxillary key:value records too.
-p
-P
Enable (-p) or disable (-P) read-pairing. By default this is enabled. The purpose of this is to link sequences from the same template to each other such that gap5 knows the insert size and read-pairings. Generally this is desirable, but it adds extra time and memory to identify the pairs. Hence for single-ended runs the option exists to disable attempts at read-pairing.
-f
Attempt a faster form of read-pairing. In this mode we link the second occurrence of a template to the first occurrence, but not vice versa. This is sufficient for the template display graphical views to work, but will cause other parts of the program to behave inconsistently. For example the contig editor "goto..." popup menu will sometimes be missing.
-t
-T
Controls whether to index (-t) or not (-T) the sequence names. By default this is disabled. Adding a sequence name index permits us to search by sequence name or to use a sequence name in any dialogue that requires a contig identifier. However it consumes more disc space to store this index and it can be time consuming to construct it.
-r nseq
Reserves space for at least nseq sequences. This generally isn't necessary, but if the total number of records extends above 2 million (equivalent to 2 billion sequences, or less if we have lots of contigs, bins and annotation records to write) then we run out of suitable sequence record numbers. This option preallocates the lower record numbers and reserves them solely for sequence records.
-c compression_method
Specifies an alternate compression method. This defaults to zlib, but can be set to either none for fastest speed or lzma for best compression.

first previous next last contents
Last generated on 25 November 2011.