/*
Copyright (c) 2014 Genome Research Ltd.
Author: Jonathan Hinton jwh@sanger.ac.uk
This program is free software: you can redistribute it and/or modify it under
the terms of the GNU Affero General Public License as published by the Free
Software Foundation; either version 3 of the License, or (at your option) any
later version.
This program is distributed in the hope that it will be useful, but WITHOUT
ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more
details.
You should have received a copy of the GNU Affero General Public License
along with this program. If not, see .
*/
This utility is intended to streamline the process of submitting sample related data files to the Cancer Genome Project via online methods
(e.g. ftp) or physically via hard disk drive.
At its most basic level the utility performs the following tasks:
Associate sample names with data files
Copy the sample files to a given destination (i.e. a hard disk drive)
Compress and bundle (in this case tar) the sample files
Encrypt all files transferred
Once complete the data is then ready to submit to the CGP
######################
Installation #########
######################
The application is written in the Java programming language and is distributed as a single executable .jar file (dataSubmission-##.jar). In order for
it to run you must have at least Java 1.6 runtime environment installed on your system. If you do not either please consult your IT department
or down load the latest version of Java from https://www.java.com/.
To install, uncompress the dataSubmission.zip file to a location of your choice. Thats it!
To run the utility you will need to open a command line terminal, locate the following application file: ..\dataSubmission\dataSubmission-##.jar and type
the following:
java -jar dataSubmission-##.jar -h
This will display all of the utility options available to you
######################
Getting Started ######
######################
Using the exact sample names you supplied to the CGP in the Sample Manifest you will now need to create a Driver Manifest. The Driver Manifest
is a simple text file that will help the utility to process your submission. The Driver file can be created in whatever program you like as long
as it is saved as a tab-delimited .txt file.
The structure of this file should be as follows:
SAMPLE_NAMELIBRARYRUNLANEFILE_TYPEFILE_PATH
where is the tab character. Lines beginning with the '#' character will be ignored so if you wish to include a header line to make it
easier to read, it must begin with #. Each file must be represented by a single line regardless if it belongs to the same sample. To make life easier
the utility will accept space seperated fields for this reason none of the field entries should contain spaces as the file will not be parsed correctly.
An example:
#SAMPLE_NAME LIBRARY RUN LANE FILE_TYPE FILE_PATH
my_sample lib_1 12 5 bam C:\sample_files\my_sample\my_sample.bam
my_sample lib_1 12 5 fastq1 C:\sample_files\my_sample\my_sample.fastq1
my_sample lib_1 12 5 fastq2.gz C:\sample_files\my_sample\my_sample.fastq2.gz
my_sample2 lib_1 12 5 bam C:\sample_files\my_sample2\my_sample.bam
List of acceptable file types:
bam -> bam
fastq -> fastq, fq
fastq1 -> fastq1, fq1
fastq2 -> fastq2, fq2
fastq.gz -> fastq.gz, fq.gz
fastq1.gz -> fastq1.gz, fq1.gz
fastq2.gz -> fastq2.gz, fq2.gz
The utility only uses these file paths to locate the sample files for copying so no information about your underlying file system is recorded.
WARNING: The Driver Manifest file is only used to direct the utility and MUST NOT be included with the submission.
The utility requires three inputs:
(-s) The CGP Submission ID
(-d) The Driver Manifest file path
(-o) The destination directory
The CGP Submission ID will be supplied to you by your CGP contact. If you do not have one please acquire one before proceeding. If this part is
incorrect we will not be able to process your submission.
The output directory is where the compressed encrypted data will be written to. If the original data to be submitted is hundreds of gigabytes
in size then the output will likely be a similar size. Please ensure the location you designate as your output location has sufficient
available storage. If you need to move the compressed data from one file location to another it is highly recommended that that you verify
the files using the md5 check-sum file generated by the utility.
If you intend to submit your data via hard disk drive it is recommended that either you use a new device or perform a low-level format of an
existing device prior to transferring any data. It is acceptable to point the utility directly to the device to save moving files around.
It is possible to ask the utility to perform a "dry run" using the -y parameter. This will perform all the validation steps and will estimate how
much space is likely to be needed by the resulting file.
An example command might look like:
java -jar dataSubmission.jar -y -d C:\secure_manifest_files\CGP\submission_12345\driver_manifest.xlsx -o F:\
>###
>Performing a dry run. Nothing will be written to file
>###
>###
>Estimating 0.32MB of space is required for 2 files
>###
java -jar dataSubmission.jar -d C:\secure_manifest_files\CGP\submission_12345\driver_manifest.xlsx -o F:\
>###
>Estimating 0.32MB of space is required for 2 files
>###
>Adding archive entry: data/my_sample/my_lib/123/2/test.bam
>Adding archive entry: data/my_sample/my_lib/123/2/test2.bam
The utility will output the the encrypted data to a single file called data_###.gpg where ### is the CGP submission ID supplied with the Sample
Manifest file.
WARNING: The utility does not contain the decryption key and therefore cannot open or modify the data_###.gpg file once it has been written. If
You want to add more data to the submission you can create a new driver file and perform the steps above and create a second data_###.#.gpg file.
The utility has to read, check, compress, encrypt and write your data in one go. The time it takes to complete will vary according the amount of
data you are trying to submit and the speed of your systems. If the total size of your data files is measured in hundreds of gigabytes the
utility will take a long time to complete. As such it is recommended that you start this process on a stable computer that is unlikely to be
switched off during the packing process.
Once the encrypted file has been created please do NOT change its name as this may cause our automated systems to ignore your submission.