Created Sun Feb 24 00:01:18 2013 Contact ba1@sanger.ac.uk More detail about reasons: 5PrimePeptExtension 5 prime end of peptide extended beyond true start Alu Alu repeats (5 AG/CT 3), subclass of SINEs artifact HAVANA reason: artifact athRage UCSC reason: athRage chimeric NCBI reason: chimeric Chimeric_cDNA Predicted Chimeric cDNA Chimeric_clone Chimeric clone Chimeric_protein Predicted Chimeric protein circular_reference Circular reference between Ensembl and SwissProt e contains_genomic NCBI reason: contains genomic Cytochrome Cytochrome documented_aberrant NCBI reason: documented aberrant Env Similar to Envelope protein frameshift NCBI reason: frameshift from_genomic From genomic Gag Similar to Gag-protein HiThru High-throughput, low quality Hypothetical Protein is hypothetical/putative according to desc invitroNorm UCSC reason: invitroNorm L1 LINE-1 retrotransposon L1_transposable LINE-1 transposable element LINE Long interspersed nuclear element (repeat) Long Sequence is long and therefore difficult to align Long_intron Long intron LTR Long terminal repeat Memory Requires too much memory mitochondrial NCBI reason: mitochondrial nedo NEDO None Deprecated: no reason given not_a_gene NCBI reason: not a gene N_string Sequence contains a string of N orestes UCSC reason: orestes ORF Open reading frame (ORF) organelle NCBI reason: organelle Other Deprecated: unclassified reason P40 P40 Partial Fragment/partial Pol Similar to Pol protein poor_quality_sequence NCBI reason: poor quality sequence Pro Similar to Pro protein probably_wrong_genome NCBI reason: probably wrong genome Promiscuous Aligns to the genome too many times read-through NCBI reason: read-through Read-through_cDNA Read-through transcript Read-through_protein Protein translated from read-through transcript Repetitive Repetitive/low complexity restored NCBI reason: restored Retained_intron Retained intron retrovirus-like NCBI reason: retrovirus-like Riken Riken RT Reverse transcriptase Short Sequence is short and therefore difficult to align SINE Short interspersed nuclear element (repeat) synthetic_construct Synthetic construct Testis Sequenced from testis Transposable_elements Supports transposable elements Transposase Transposase Ty5 Transposon Ty5 Un-genewiseable Un-genewiseable - predominantly ACTG unknown NCBI reason: unknown Viral Sequence is from a virus, unsuitable for vertebrat Wrong_ORF Wrong open reading frame wrong_ORF NCBI reason: wrong ORF wrong_splice Splice-site issues wrong_strand NCBI reason: wrong strand X Sequence contains too many X The 7 columns in Ensembl_AccessionCategory.txt file are: 1. taxon_id Taxonomy ID for the accession that we have killed. 2. version Accession, with version, for the accession that we have killed. 3. mol_type Molecule type: cDNA or protein. 4. reasons Internal reasons given for why the accession was killed. Multiple reasons are allowed and they are separated by commas. 5. analyses_allowed For many accessions this field will not be populated and it will just have a value of '.'. Any analysis listed in this column is an analysis that we will allow the accession to be used in our pipeline for. For example, there are 4,886 accessions with the value 'cDNA_update' in this column. These accessions will be killed for all Ensembl analyses except for the cDNA_update analysis. In this particular case it means that the cDNAs will never be used to make transcript models or for UTR addition, but that they will be aligned to the genome and displayed in our cDNA track. 6. species_allowed This option is rarely used, as usually when an accession is killed then it is killed for all species. Similar to the analyses_allowed, this option allows us to kill an accession for all species except the species listed (by taxonomy ID) here. eg. The mouse sequence Q8BT34.1 can be used as supporting evidence in a mouse build but will not be used as supporting evidence for any other species. 7. timestamp Date YYYY-MM-DD on which the accession was last added/modified.