Genetic variants are typically described in terms of their difference from a reference genome sequence. In the case of many INDELs it is impossible to know exactly where the original mutation occurred relative to the reference, and the position of the insertion or deletion may be assigned arbitrarily. For example, an INDEL that yields the alternative sequence GATTC for a reference sequence GATTTC can be described in three ways: as a deletion of the T at position 3, at position 4, or position 5. The distinction isn't important from a biological standpoint since all three mutations yield the same alternative sequence, which the cellular machinery must interpret. However, existing INDEL annotation tools do not take this issue into account when comparing new INDELs to existing ones in a database. Furthermore, existing INDEL annotation tools accept the assigned position as the only position, and can fail to report consequences that are important possibilities by failing to examine the overlap between the equivalent representations of the INDEL and features such as start codons, splice sites, and UTRs.
Set up NGS-SNP. Note that the simplest approach is to follow the "Linux virtual machine" section of the installation guide.
Obtain a list of INDELs from SAMtools, or some other INDEL calling software.
Annotate the INDEL list using the annotate_INDELs.pl script.
The following commands illustrate a typical INDEL annotation session in which INDELs are annotated and then processed:
cd NGS-SNP/scripts/annotate_INDELs perl annotate_INDELs.pl -s bos_taurus -i indel.vcf -f indels.vcf.flanking -o indels.vcf.annotated -p indels.vcf.progress -split indels.vcf.split cd utilities perl obtain_INDEL_equivalents_stats.pl -i indels.vcf.annotated -o consequence_changes.txt -s stats.txt perl obtain_INDEL_characteristics.pl -i indels.vcf.annotated -o figures
This script annotates short INDELs identified the sequencing. It is designed to accept INDELs from any other program that can generate a valid VCF file. More information on input format is given below. This script differs from many others in that it correctly accounts for the fact that a single INDEL can have several equivalent representations. When comparing a new INDEL to known INDELs the script first coverts the known INDEL into all equivalent representations so that matches, if present, can be detected. Similarly, for annotation purposes all equivalent representations are considered and evaluated.
If you find that the annotation process sometimes fails due to lost connections to the Ensembl database or other required resources try using the script 'run_annotate_INDELs.pl' in place of this script. The same command-line parameters are used for both scripts. When the 'run_annotate_INDELs.pl' script is used the annotation process can restart automatically if problems are encountered, starting at the first unannotated INDEL.
By default this script obtains sequences and annotations from Ensembl, UniProt, and NCBI. To speed up the annotation process you may want to set up the required Ensembl databases locally. For the sample data in the test directory the required Ensembl databases are bos_taurus_core, bos_taurus_variation, bos_taurus_funcgen, homo_sapiens_core, homo_sapiens_variation, ensembl_compara, and ensembl_ontology. For more information on installing Ensembl databases see the NGS-SNP installation instructions.
See setup instructions for NGS-SNP.
The INDELs provided as input to this script must have been obtained from a species contained in the Ensembl database. Only the standard VCF format is accepted.
The following is an example of the "VCF" format (see VCF (Variant Call Format) version 4.0). For annotation purposes NGS-SNP uses the "CHROM", "POS", "REF", and "ALT" columns. The "QUAL", "FILTER", and "INFO" columns are not used but are included in the annotated output. And all the genotype information are included in the annotated output. The first five columns must appear in the order depicted below. Note that only the rows describing INDELs are processed by NGS-SNP. The header should include a line starting with "##fileformat=VCF". Use '-format vcf' to process files in this format if "##fileformat=VCF" is missing in your input :
##fileformat=VCFv4.0 #CHROM POS ID REF ALT QUAL FILTER INFO Chr11 3511039 . G GA 60.6 . INDEL;DP=35;VDB=0.0445;AF1=1;AC1=2;DP4=0,0,1,5;MQ=38;FQ=-52.5 Chr11 3511061 . T TC 183 . INDEL;DP=30;VDB=0.0435;AF1=1;AC1=2;DP4=0,0,4,5;MQ=42;FQ=-61.5 Chr11 44710813 . T TG 105 . INDEL;DP=18;VDB=0.0449;AF1=0.5;AC1=1;DP4=7,4,6,1;MQ=49;FQ=108;PV4=0.6,1,1,1
All of the fields present in the input file are reproduced in the annotated INDEL list. The first field added by NGS-SNP is called "Functional_Class"--this and the other fields are described below:
(Ensembl 68 and newer)
|splice_donor_variant||Essential splice site||A splice variant that changes the 2 base region at the 5' end of an intron.|
|splice_acceptor_variant||Essential splice site||A splice variant that changes the 2 base region at the 3' end of an intron.|
|stop_gained||Stop gained||A sequence variant whereby at least one base of a codon is changed, resulting in a premature stop codon, leading to a shortened transcript.|
|stop_lost||Stop lost||A sequence variant where at least one base of the terminator codon (stop) is changed, resulting in an elongated transcript.|
|frameshift_variant||Frameshift coding||A sequence variant which causes a disruption of the translational reading frame, because the number of nucleotides inserted or deleted is not a multiple of three.|
|initiator_codon_variant||Non synonymous coding||A codon variant that changes at least one base of the first codon of a transcript.|
|inframe_insertion||Non synonymous coding||An inframe non synonymous variant that inserts bases into in the coding sequence.|
|inframe_deletion||Non synonymous coding||An inframe non synonymous variant that deletes bases from the coding sequence.|
|missense_variant||Non synonymous coding||A sequence variant, that changes one or more bases, resulting in a different amino acid sequence but where the length is preserved.|
|splice_region_variant||Splice site||A sequence variant in which a change has occurred within the region of the splice site, either within 1-3 bases of the exon or 3-8 bases of the intron.|
|incomplete_terminal_codon_variant||Partial codon||A sequence variant where at least one base of the final codon of an incompletely annotated transcript is changed.|
|coding_sequence_variant||Coding unknown||A sequence variant that changes the coding sequence.|
|mature_miRNA_variant||Within mature miRNA||A transcript variant located with the sequence of the mature miRNA.|
|5_prime_UTR_variant||5prime UTR||A UTR variant of the 5' UTR.|
|3_prime_UTR_variant||3prime UTR||A UTR variant of the 3' UTR.|
|intron_variant||Intronic||A transcript variant occurring within an intron.|
|non_coding_exon_variant||Within non coding gene||A sequence variant that changes non-coding exon sequence.|
|nc_transcript_variant||Within non coding gene||A transcript variant of a non coding RNA.|
|NMD_transcript_variant||NMD transcript||A variant in a transcript that is the target of NMD.|
|upstream_gene_variant||Upstream||A sequence variant located 5' of a gene.|
|downstream_gene_variant||Downstream||A sequence variant located 3' of a gene.|
|TF_binding_site_variant||Regulatory region||A sequence variant located within a transcription factor binding site.|
|regulatory_region_variant||Regulatory region||A sequence variant located within a regulatory region.|
|feature_elongation||Feature elongation||A sequence variant that causes the extension of a genomic feature, with regard to the reference sequence.|
|feature_truncation||Feature truncation||A sequence variant that causes the reduction of a genomic feature, with regard to the reference sequence.|
|intergenic_variant||Intergenic||A sequence variant located in the intergenic region, between genes.|
|Phenotypes_Gene||phenotypic information associated with the model species gene, obtained from NCBI's Gene database. The value for this key is not calculated if the -no_ncbi option is specified.|
|Interactions_Count||the number of reported interactions involving proteins produced by the model species gene, obtained from NCBI's Gene database. The value for this key is not calculated if the -no_ncbi option is specified.|
|KEGG||KEGG pathways associated with the model species gene, obtained from NCBI's Gene database. The value for this key is not calculated if the -no_ncbi option is specified.|
|Overlapping_Protein_Features||protein features from the model species gene, obtained from UniProt, that overlap with the position of the relevant amino acid. The value for this key is not calculated if the -no_uniprot option is specified.|
|Genotype||genotype information parsed from the input VCF file.|
|Gene_Biotype||a gene classification to indicate the biological significance of the gene (e.g. whether it is protein coding or a pseudogene).|
|Transcript_Biotype||a transcript classification to indicate the biological significance of the transcript (e.g. whether it is protein coding or a pseudogene).|
|Gene_Status||a "Known" gene has at least one transcript with a sequence match in a sequence repository external to Ensembl for the same species. "Known by Projection" refers to genes that are homologous, based on Ensembl comparative analysis, to genes with "Known" status in another species (usually human genes). A "Novel" gene contains only transcripts that have a sequence match outside Ensembl for an alternate species. "Merged" indicates that the gene has at least one merged (gold) transcript. A merged transcript is a case where an identical sequence to the Ensembl prediction has been determined by the VEGA/Havana project. Two distinctions are made. If the coding sequence (CDS) is the same, but the UTR differs, the link to the Havana ID (OTT...) will read: "Havana transcript having same CDS". If the entire transcript (both UTR and CDS) is the same in Havana and Ensembl, the link to the Havana transcript will read: "Transcript having exact match between Ensembl and Havana". Merged genes are only available for human, mouse, and zebrafish.|
|Transcript_Status||a "Known" transcript has a sequence match in a sequence repository external to Ensembl for the same species. A "Novel" transcript has a sequence match outside Ensembl for an alternate species.|
|Length_Downstream_Protein||the distance in amino acids from the INDEL to the end of the affected protein. A ratio further describing the location of the INDEL relative to the protein is given in parentheses. Distances are reported when the INDEL is in the coding regions.|
|Distance_From_Transcript||the distance in bases from the INDEL to the transcript. A qualifier further describing the location of the INDEL relative to the transcript is given in parentheses. Distances are reported when the INDEL is upstream or downstream of the transcript.|
|SIFT_Prediction_Ensembl||a prediction of whether or not the amino acid change is deleterious. This key may be given for INDELs that are classified as a missense INDEL because they cause an amino acid substitution in addition to inserting/deleting one or more amino acids. The format is 'prediction(score)'. See the SIFT web site for more information. This value is obtained from Ensembl and may not be available for all species.|
|Other_Consequences||additional functional consequences for this variant. Multiple consequences can be predicted, for example, when sequence features overlap, or because multiple consequence categories apply. Values are given in the form 'function_class(gene_id,transcript_id,protein_id)' where the gene, transcript, and protein IDs are those obtained from Ensembl and indicate which features may be affected by the consequence.|
|Number_Of_Equivalent_Indels||the number of equivalent INDELs that can exist for this INDEL, given in the form 'number of equivalent INDELs(number of upstream equivalent INDELs on the forward strand of the reference,number of downstream equivalent INDELs on the forward strand of the reference)'. For example, the value '28(28,0)' indicates that there are 28 equivalent INDELs for the input INDEL (not including the input INDEL itself) and that all the equivalent INDELs have smaller positions on the forward strand of the reference sequence. Equivalent INDELs are those that lead to the formation of identical sequences, but which differ in terms of how they are described. Consider the reference sequence ATTGC. Deletion of the first or the second T gives the same sequence, ATGC. However, the two deletions differ with respect to how they are described. The first would be described as a deletion at position 2 and could be denoted as A[T/-]TGC whereas the second would be described as a deletion at position 3 and could be denoted as AT[T/-]GC. Although it is possible that the two INDELs arose from independent events affecting different sites in the reference sequence, the two INDELs are now equivalent and indistinguishable.|
|Consequences_For_Equivalent_Indels||functional consequences for the equivalent representations of this INDEL. Values are given in the form 'function_class(position,reference allele,alternative allele,gene_id,transcript_id,protein_id)' where position, reference allele, and alternative allele describe the equivalent INDEL to which the consequence applies, and the gene, transcript, and protein IDs are those obtained from Ensembl and indicate which features may be affected by the consequence. Although equivalent INDELs are biologically indistinguishable they can differ with respect to their positions relative to annotated features on the reference sequence, and thus can yield different functional consequence predictions. The true functional consequence of an INDEL will depend on how the cellular machinery interprets the alternative sequence. The -eq option specifies the maximum number of consequences to report in this section. The consequences in this section are sorted such that those deemed to be most severe are reported first.|
The flanking sequence file consists of two columns:
A single input INDEL can have multiple consequences for a variety of reasons. For example, one of the alleles may lead to a non-synonymous change in a coding sequence while another introduces a stop codon. Also, a single INDEL may lead to alterations in several transcripts arising from a single gene. By default, this script reports a single consequence for each INDEL. The reported consequence is selected based on the functional class. All the other consequences for the input INDELs are given in the comments column.
Use the -p option to keep track of annotation progress so that the annotation can be continued if a problem is encountered. To have the analysis relaunched automatically when errors are encountered, use the script 'run_annotate_INDELs.pl' in place of this script.
The -f option can be used to obtain the genomic flanking sequence and INDEL alleles for each INDEL. The INDEL alleles are shown between square brackets. The -df option causes known INDEL sites in the flanking sequence and at the INDEL position to be included in the output. The flanking sequence and alleles are given as they would appear on the forward strand of the chromosome containing the INDEL.
The -ann_db option is particularly useful when numerous INDELs from different individuals of the same species are to be annotated by repeated use of this script.
USAGE: perl annotate_INDELs.pl [-arguments] -i [STRING] : file containing INDELs to annotate. The script will attempt to determine the input file format automatically, unless the -format option is used to indicate format (Required). -o [FILE] : the output file to create (Required). -s [STRING] : the source species in the form genus_species (e.g. 'homo_sapiens', 'bos_taurus', 'sus_scrofa'). To view the complete list of available species use the -show_species option (Required). -f [FILE] : write genomic flanking sequence for each INDEL to this file (Optional). -n [STRINGS] : chromosome name changes to make when querying Ensembl. For example, 'chr30=chrX' causes all instances of 'chr30' in the input file to be treated as 'chrX' for Ensembl searching. To view the chromosome names recognized by Ensembl use the -show_chr option (Optional). -d [INTEGER] : amount of flanking genomic sequence to write on each side of INDELs when -f option is used (Optional; default is 100 bases). -c : begin by counting the number of records already present in the output file and skip that number of input records. This option can be useful if the script was disrupted during a previous run. Do not use if the -all option was used when creating the existing output file. Use of this option is no longer recommended--use the -p option instead (Optional). -p [FILE] : file for tracking annotation progress. If the file does not exist, the script will create the file and use it to keep track of which records have been annotated as annotation progresses. If the file already exists, the script will parse it to determine where the annotation process last ended, so that it can continue. Use this option so that the annotation can continue with the first unannotated INDEL in the event of a disrupted network connection or error (Optional). -L [FILE] : log file to contain additional INDEL information and processing messages (Optional). -v : provide verbose output (Optional). -sk [FILE] : file to contain list of input INDELs that were skipped because they could not be processed (Optional). -no_multi : skip multi-allelic INDELs (Optional; default is to annotate the first alternative allele). -split [DIRECTORY] : split up the results based on chromosome and functional class and write to this directory (Optional). -format [STRING] : format of input file ('vcf') (Optional; default is to determine format from the input). -model [STRING] : the model species to use when filling in the 'Model_Annotations' column in the output. The model species can be the same as the source species. See the -s option for information on species name format and on available species. (Optional; default value is 'homo_sapiens'). -ncbi [FILE] : the location of the ncbi_search.pl script (Optional; default is to locate automatically). -uniprot [FILE] : the location of the ebi_fetch.pl script (Optional; default is to locate automatically). -no_ncbi : do not perform annotation steps that involve retrieving data from NCBI. Using this option will slightly speed up the annotation process but will decrease the amount of information provided in the 'Model_Annotations' column. Do not run multiple instances of this script when not using this option (Optional; default is to perform steps that require data from NCBI). -no_uniprot : do not perform annotation steps that involve retrieving data from UniProt. Using this option will slightly speed up the annotation process but will decrease the amount of information provided in the 'Overlapping_Protein_Features' and 'Model_Annotations' columns. Do not run multiple instances of this script when not using this option (Optional; default is to perform steps that require data from UniProt). -all : return all consequences for each INDEL. A single input INDEL can have multiple consequences for a variety of reasons (an INDEL may be located in the exon of one gene and the intron of another). The -all option causes each consequence to be described by a separate row in the output file. (Optional; default is to try to return the most severe consequence for each INDEL). -status [STRING] : the types of genes and transcripts to be used during annotation. Values "known" or "novel" can be specified. (Optional; default is to use all genes and transcripts). -assembly [STRING] : the name of the assembly that was used as the reference for INDEL discovery, for example, 'Btau_4.0' or 'NCBI34'. If the name supplied using this option does not match the name of the default assembly available in Ensembl an attempt is made to convert the INDEL locations and alleles to the Ensembl assembly. To view the assembly names recognized, use the -show_assem option (Optional; default is to assume the default Ensembl assembly was used for INDEL discovery). -mismatch [INTEGER]: program behaviour when a reference allele in the input file disagrees with the genome assembly in Ensembl. 1 = exit with error; 2 = use the reference allele from the Ensembl assembly. Such mismatches can arise when the -assembly option is used to convert the INDEL locations and alleles to the Ensembl assembly (Optional; default is 1). -host [STRING] : the Ensembl database host (Optional; default is 'ensembldb.ensembl.org'). -user [STRING] : the Ensembl database user (Optional; default is 'anonymous'). -pass [STRING] : the Ensembl database password (Optional; default is ''). -port [INTEGER] : the TCP port to use when connecting to the Ensembl database (Optional; default is to let the script choose the port). -ann_db [DIRECTORY]: create or use flat file databases in this directory for storing the annotated INDELs and flanking sequences generated by this script, so that future INDELs located at the same position and involving the same alleles can be annotated more quickly. The same directory can always be specified regardless of the species and other options being used--previously calculated annotations and flanking sequences are only applied to a new INDEL when appropriate (i.e. same species, alleles, etc.). This option is particularly useful when numerous INDELs from different individuals of the same species are to be annotated by repeated use of this script. -help : provide list of arguments and exit (Optional). -show_species : show the species values that can be used with the -s, -cs, and -model options and then exit. The -host option should also be given, unless the default host is to be used for annotation (Optional). -show_assem : show the assembly values that can be used with the -assembly option and then exit. The -s option is required with this option, to specify the species (Optional). -show_chr : show the chromosome names recognized by Ensembl and then exit. This information is useful when using the -n option. The -s option is required with this option, to specify the species (Optional). perl annotate_INDELs.pl -i indels.tab -o results.tab -s bos_taurus -p progress