/* Copyright (c) 2014 Genome Research Ltd. Author: Jonathan Hinton jwh@sanger.ac.uk This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details. You should have received a copy of the GNU Affero General Public License along with this program. If not, see . */ This utility is intended to streamline the process of submitting sample related data files to the Cancer Genome Project via online methods (e.g. ftp) or physically via hard disk drive. At its most basic level the utility performs the following tasks: Associate sample names with data files Copy the sample files to a given destination (i.e. a hard disk drive) Compress and bundle (in this case tar) the sample files Encrypt all files transferred Once complete the data is then ready to submit to the CGP ###################### Installation ######### ###################### The application is written in the Java programming language and is distributed as a single executable .jar file (dataSubmission-##.jar). In order for it to run you must have at least Java 1.6 runtime environment installed on your system. If you do not either please consult your IT department or down load the latest version of Java from https://www.java.com/. To install, uncompress the dataSubmission.zip file to a location of your choice. Thats it! To run the utility you will need to open a command line terminal, locate the following application file: ..\dataSubmission\dataSubmission-##.jar and type the following: java -jar dataSubmission-##.jar -h This will display all of the utility options available to you ###################### Getting Started ###### ###################### Using the exact sample names you supplied to the CGP in the Sample Manifest you will now need to create a Driver Manifest. The Driver Manifest is a simple text file that will help the utility to process your submission. The Driver file can be created in whatever program you like as long as it is saved as a tab-delimited .txt file. The structure of this file should be as follows: SAMPLE_NAMELIBRARYRUNLANEFILE_TYPEFILE_PATH where is the tab character. Lines beginning with the '#' character will be ignored so if you wish to include a header line to make it easier to read, it must begin with #. Each file must be represented by a single line regardless if it belongs to the same sample. To make life easier the utility will accept space seperated fields for this reason none of the field entries should contain spaces as the file will not be parsed correctly. An example: #SAMPLE_NAME LIBRARY RUN LANE FILE_TYPE FILE_PATH my_sample lib_1 12 5 bam C:\sample_files\my_sample\my_sample.bam my_sample lib_1 12 5 fastq1 C:\sample_files\my_sample\my_sample.fastq1 my_sample lib_1 12 5 fastq2.gz C:\sample_files\my_sample\my_sample.fastq2.gz my_sample2 lib_1 12 5 bam C:\sample_files\my_sample2\my_sample.bam List of acceptable file types: bam -> bam fastq -> fastq, fq fastq1 -> fastq1, fq1 fastq2 -> fastq2, fq2 fastq.gz -> fastq.gz, fq.gz fastq1.gz -> fastq1.gz, fq1.gz fastq2.gz -> fastq2.gz, fq2.gz The utility only uses these file paths to locate the sample files for copying so no information about your underlying file system is recorded. WARNING: The Driver Manifest file is only used to direct the utility and MUST NOT be included with the submission. The utility requires three inputs: (-s) The CGP Submission ID (-d) The Driver Manifest file path (-o) The destination directory The CGP Submission ID will be supplied to you by your CGP contact. If you do not have one please acquire one before proceeding. If this part is incorrect we will not be able to process your submission. The output directory is where the compressed encrypted data will be written to. If the original data to be submitted is hundreds of gigabytes in size then the output will likely be a similar size. Please ensure the location you designate as your output location has sufficient available storage. If you need to move the compressed data from one file location to another it is highly recommended that that you verify the files using the md5 check-sum file generated by the utility. If you intend to submit your data via hard disk drive it is recommended that either you use a new device or perform a low-level format of an existing device prior to transferring any data. It is acceptable to point the utility directly to the device to save moving files around. It is possible to ask the utility to perform a "dry run" using the -y parameter. This will perform all the validation steps and will estimate how much space is likely to be needed by the resulting file. An example command might look like: java -jar dataSubmission.jar -y -d C:\secure_manifest_files\CGP\submission_12345\driver_manifest.xlsx -o F:\ >### >Performing a dry run. Nothing will be written to file >### >### >Estimating 0.32MB of space is required for 2 files >### java -jar dataSubmission.jar -d C:\secure_manifest_files\CGP\submission_12345\driver_manifest.xlsx -o F:\ >### >Estimating 0.32MB of space is required for 2 files >### >Adding archive entry: data/my_sample/my_lib/123/2/test.bam >Adding archive entry: data/my_sample/my_lib/123/2/test2.bam The utility will output the the encrypted data to a single file called data_###.gpg where ### is the CGP submission ID supplied with the Sample Manifest file. WARNING: The utility does not contain the decryption key and therefore cannot open or modify the data_###.gpg file once it has been written. If You want to add more data to the submission you can create a new driver file and perform the steps above and create a second data_###.#.gpg file. The utility has to read, check, compress, encrypt and write your data in one go. The time it takes to complete will vary according the amount of data you are trying to submit and the speed of your systems. If the total size of your data files is measured in hundreds of gigabytes the utility will take a long time to complete. As such it is recommended that you start this process on a stable computer that is unlikely to be switched off during the packing process. Once the encrypted file has been created please do NOT change its name as this may cause our automated systems to ignore your submission.