NGS-SNP - annotate_INDELs.pl

Introduction

Genetic variants are typically described in terms of their difference from a reference genome sequence. In the case of many INDELs it is impossible to know exactly where the original mutation occurred relative to the reference, and the position of the insertion or deletion may be assigned arbitrarily. For example, an INDEL that yields the alternative sequence GATTC for a reference sequence GATTTC can be described in three ways: as a deletion of the T at position 3, at position 4, or position 5. The distinction isn't important from a biological standpoint since all three mutations yield the same alternative sequence, which the cellular machinery must interpret. However, existing INDEL annotation tools do not take this issue into account when comparing new INDELs to existing ones in a database. Furthermore, existing INDEL annotation tools accept the assigned position as the only position, and can fail to report consequences that are important possibilities by failing to examine the overlap between the equivalent representations of the INDEL and features such as start codons, splice sites, and UTRs.

INDEL annotation scripts

INDEL annotation

  1. Set up NGS-SNP. Note that the simplest approach is to follow the "Linux virtual machine" section of the installation guide.

  2. Obtain a list of INDELs from SAMtools, or some other INDEL calling software.

  3. Annotate the INDEL list using the annotate_INDELs.pl script.

The following commands illustrate a typical INDEL annotation session in which INDELs are annotated and then processed:

cd NGS-SNP/scripts/annotate_INDELs
perl annotate_INDELs.pl -s bos_taurus  -i indel.vcf -f indels.vcf.flanking  -o indels.vcf.annotated  -p indels.vcf.progress  -split indels.vcf.split
cd utilities
perl obtain_INDEL_equivalents_stats.pl -i indels.vcf.annotated -o consequence_changes.txt -s stats.txt
perl obtain_INDEL_characteristics.pl -i indels.vcf.annotated -o figures
    

Description

This script annotates short INDELs identified the sequencing. It is designed to accept INDELs from any other program that can generate a valid VCF file. More information on input format is given below. This script differs from many others in that it correctly accounts for the fact that a single INDEL can have several equivalent representations. When comparing a new INDEL to known INDELs the script first coverts the known INDEL into all equivalent representations so that matches, if present, can be detected. Similarly, for annotation purposes all equivalent representations are considered and evaluated.

Note

If you find that the annotation process sometimes fails due to lost connections to the Ensembl database or other required resources try using the script 'run_annotate_INDELs.pl' in place of this script. The same command-line parameters are used for both scripts. When the 'run_annotate_INDELs.pl' script is used the annotation process can restart automatically if problems are encountered, starting at the first unannotated INDEL.

By default this script obtains sequences and annotations from Ensembl, UniProt, and NCBI. To speed up the annotation process you may want to set up the required Ensembl databases locally. For the sample data in the test directory the required Ensembl databases are bos_taurus_core, bos_taurus_variation, bos_taurus_funcgen, homo_sapiens_core, homo_sapiens_variation, ensembl_compara, and ensembl_ontology. For more information on installing Ensembl databases see the NGS-SNP installation instructions.

Setup

See setup instructions for NGS-SNP.

Input

The INDELs provided as input to this script must have been obtained from a species contained in the Ensembl database. Only the standard VCF format is accepted.

VCF

The following is an example of the "VCF" format (see VCF (Variant Call Format) version 4.0). For annotation purposes NGS-SNP uses the "CHROM", "POS", "REF", and "ALT" columns. The "QUAL", "FILTER", and "INFO" columns are not used but are included in the annotated output. And all the genotype information are included in the annotated output. The first five columns must appear in the order depicted below. Note that only the rows describing INDELs are processed by NGS-SNP. The header should include a line starting with "##fileformat=VCF". Use '-format vcf' to process files in this format if "##fileformat=VCF" is missing in your input :

##fileformat=VCFv4.0
#CHROM   POS     ID        REF ALT    QUAL FILTER INFO                             
Chr11	3511039	.	G	GA	60.6	.	INDEL;DP=35;VDB=0.0445;AF1=1;AC1=2;DP4=0,0,1,5;MQ=38;FQ=-52.5
Chr11	3511061	.	T	TC	183	.	INDEL;DP=30;VDB=0.0435;AF1=1;AC1=2;DP4=0,0,4,5;MQ=42;FQ=-61.5
Chr11	44710813	.	T	TG	105	.	INDEL;DP=18;VDB=0.0449;AF1=0.5;AC1=1;DP4=7,4,6,1;MQ=49;FQ=108;PV4=0.6,1,1,1

    

Output fields in annotated INDEL files

All of the fields present in the input file are reproduced in the annotated INDEL list. The first field added by NGS-SNP is called "Functional_Class"--this and the other fields are described below:

Functional_Class
the type of INDEL ('inframe_deletion' for example). They are a subset of the variation classes used by Ensembl, and are ordered below according to the level of severity assigned by Ensembl (and annotate_INDELs.pl), from most severe to least severe (note that the terms changed in Ensembl 68):
New Term
(Ensembl 68 and newer)
Old Term
(pre-Ensembl 68)
Description
splice_donor_variant Essential splice site A splice variant that changes the 2 base region at the 5' end of an intron.
splice_acceptor_variant Essential splice site A splice variant that changes the 2 base region at the 3' end of an intron.
stop_gained Stop gained A sequence variant whereby at least one base of a codon is changed, resulting in a premature stop codon, leading to a shortened transcript.
stop_lost Stop lost A sequence variant where at least one base of the terminator codon (stop) is changed, resulting in an elongated transcript.
frameshift_variant Frameshift coding A sequence variant which causes a disruption of the translational reading frame, because the number of nucleotides inserted or deleted is not a multiple of three.
initiator_codon_variant Non synonymous coding A codon variant that changes at least one base of the first codon of a transcript.
inframe_insertion Non synonymous coding An inframe non synonymous variant that inserts bases into in the coding sequence.
inframe_deletion Non synonymous coding An inframe non synonymous variant that deletes bases from the coding sequence.
missense_variant Non synonymous coding A sequence variant, that changes one or more bases, resulting in a different amino acid sequence but where the length is preserved.
splice_region_variant Splice site A sequence variant in which a change has occurred within the region of the splice site, either within 1-3 bases of the exon or 3-8 bases of the intron.
incomplete_terminal_codon_variant Partial codon A sequence variant where at least one base of the final codon of an incompletely annotated transcript is changed.
coding_sequence_variant Coding unknown A sequence variant that changes the coding sequence.
mature_miRNA_variant Within mature miRNA A transcript variant located with the sequence of the mature miRNA.
5_prime_UTR_variant 5prime UTR A UTR variant of the 5' UTR.
3_prime_UTR_variant 3prime UTR A UTR variant of the 3' UTR.
intron_variant Intronic A transcript variant occurring within an intron.
non_coding_exon_variant Within non coding gene A sequence variant that changes non-coding exon sequence.
nc_transcript_variant Within non coding gene A transcript variant of a non coding RNA.
NMD_transcript_variant NMD transcript A variant in a transcript that is the target of NMD.
upstream_gene_variant Upstream A sequence variant located 5' of a gene.
downstream_gene_variant Downstream A sequence variant located 3' of a gene.
TF_binding_site_variant Regulatory region A sequence variant located within a transcription factor binding site.
regulatory_region_variant Regulatory region A sequence variant located within a regulatory region.
feature_elongation Feature elongation A sequence variant that causes the extension of a genomic feature, with regard to the reference sequence.
feature_truncation Feature truncation A sequence variant that causes the reduction of a genomic feature, with regard to the reference sequence.
intergenic_variant Intergenic A sequence variant located in the intergenic region, between genes.
Chromosome_Reference
the nucleotide found in the reference genome.
Chromosome_Reads
the nucleotides observed by resequencing.
Gene_Description
a short description of the relevant gene.
Ensembl_Gene_ID
the Ensembl Gene ID of the relevant gene.
Entrez_Gene_Name
the Entrez Gene name of the relevant gene.
Entrez_Gene_ID
the Entrez Gene ID of the relevant gene.
Ensembl_Transcript_ID
the Ensembl Transcript ID of the affected transcript.
Transcript_INDEL_Position
the position of the INDEL, on the transcript.
Transcript_INDEL_Reference
the reference genome allele, as it would appear in the mature transcript.
Transcript_INDEL_Reads
a nucleotide found by resequencing, as it would appear in the mature transcript.
Transcript_To_Chromosome_Strand
the strand of the transcript compared to the chromosome. 'forward' indicates that the alleles given in 'Transcript_INDEL_Reference' and 'Transcript_INDEL_Reads' refer to the forward strand of the reference chromosome sequence. 'reverse' indicates that the alleles refer to the reverse strand of the reference chromosome sequence.
Ensembl_Protein_ID
the Ensembl Protein ID of the relevant protein.
UniProt_ID
the UniProt ID of the relevant protein.
Amino_Acid_Position
the position of the relevant amino acid in the protein sequence.
Overlapping_Protein_Domains
protein domains that overlap with the position of the affected amino acid.
Overlapping_Protein_Features
protein features, obtained from UniProt, that overlap with the position of the relevant amino acid. This field is not calculated if the -no_uniprot option is specified.
Gene_Ontology
GO slim IDs and terms associated with the relevant transcript.
Model_Annotations
provides information transferred from the model species (specified using the -model option) orthologue of the relevant gene, or from the relevant gene itself if the input INDELs are from the model species. Information in this field is given in the form of key-value pairs.
Phenotypes_Gene phenotypic information associated with the model species gene, obtained from NCBI's Gene database. The value for this key is not calculated if the -no_ncbi option is specified.
Interactions_Count the number of reported interactions involving proteins produced by the model species gene, obtained from NCBI's Gene database. The value for this key is not calculated if the -no_ncbi option is specified.
KEGG KEGG pathways associated with the model species gene, obtained from NCBI's Gene database. The value for this key is not calculated if the -no_ncbi option is specified.
Overlapping_Protein_Features protein features from the model species gene, obtained from UniProt, that overlap with the position of the relevant amino acid. The value for this key is not calculated if the -no_uniprot option is specified.
Comments
provides additional information about the INDEL in the form of key-value pairs. The contents of this field are likely to change as NGS-SNP is updated with new annotation capabilities. By placing new annotations in this field the output format of NGS-SNP can remain relatively stable as the program is updated.
Genotype genotype information parsed from the input VCF file.
Gene_Biotype a gene classification to indicate the biological significance of the gene (e.g. whether it is protein coding or a pseudogene).
Transcript_Biotype a transcript classification to indicate the biological significance of the transcript (e.g. whether it is protein coding or a pseudogene).
Gene_Status a "Known" gene has at least one transcript with a sequence match in a sequence repository external to Ensembl for the same species. "Known by Projection" refers to genes that are homologous, based on Ensembl comparative analysis, to genes with "Known" status in another species (usually human genes). A "Novel" gene contains only transcripts that have a sequence match outside Ensembl for an alternate species. "Merged" indicates that the gene has at least one merged (gold) transcript. A merged transcript is a case where an identical sequence to the Ensembl prediction has been determined by the VEGA/Havana project. Two distinctions are made. If the coding sequence (CDS) is the same, but the UTR differs, the link to the Havana ID (OTT...) will read: "Havana transcript having same CDS". If the entire transcript (both UTR and CDS) is the same in Havana and Ensembl, the link to the Havana transcript will read: "Transcript having exact match between Ensembl and Havana". Merged genes are only available for human, mouse, and zebrafish.
Transcript_Status a "Known" transcript has a sequence match in a sequence repository external to Ensembl for the same species. A "Novel" transcript has a sequence match outside Ensembl for an alternate species.
Length_Downstream_Protein the distance in amino acids from the INDEL to the end of the affected protein. A ratio further describing the location of the INDEL relative to the protein is given in parentheses. Distances are reported when the INDEL is in the coding regions.
Distance_From_Transcript the distance in bases from the INDEL to the transcript. A qualifier further describing the location of the INDEL relative to the transcript is given in parentheses. Distances are reported when the INDEL is upstream or downstream of the transcript.
SIFT_Prediction_Ensembl a prediction of whether or not the amino acid change is deleterious. This key may be given for INDELs that are classified as a missense INDEL because they cause an amino acid substitution in addition to inserting/deleting one or more amino acids. The format is 'prediction(score)'. See the SIFT web site for more information. This value is obtained from Ensembl and may not be available for all species.
Other_Consequences additional functional consequences for this variant. Multiple consequences can be predicted, for example, when sequence features overlap, or because multiple consequence categories apply. Values are given in the form 'function_class(gene_id,transcript_id,protein_id)' where the gene, transcript, and protein IDs are those obtained from Ensembl and indicate which features may be affected by the consequence.
Number_Of_Equivalent_Indels the number of equivalent INDELs that can exist for this INDEL, given in the form 'number of equivalent INDELs(number of upstream equivalent INDELs on the forward strand of the reference,number of downstream equivalent INDELs on the forward strand of the reference)'. For example, the value '28(28,0)' indicates that there are 28 equivalent INDELs for the input INDEL (not including the input INDEL itself) and that all the equivalent INDELs have smaller positions on the forward strand of the reference sequence. Equivalent INDELs are those that lead to the formation of identical sequences, but which differ in terms of how they are described. Consider the reference sequence ATTGC. Deletion of the first or the second T gives the same sequence, ATGC. However, the two deletions differ with respect to how they are described. The first would be described as a deletion at position 2 and could be denoted as A[T/-]TGC whereas the second would be described as a deletion at position 3 and could be denoted as AT[T/-]GC. Although it is possible that the two INDELs arose from independent events affecting different sites in the reference sequence, the two INDELs are now equivalent and indistinguishable.
Consequences_For_Equivalent_Indels functional consequences for the equivalent representations of this INDEL. Values are given in the form 'function_class(position,reference allele,alternative allele,gene_id,transcript_id,protein_id)' where position, reference allele, and alternative allele describe the equivalent INDEL to which the consequence applies, and the gene, transcript, and protein IDs are those obtained from Ensembl and indicate which features may be affected by the consequence. Although equivalent INDELs are biologically indistinguishable they can differ with respect to their positions relative to annotated features on the reference sequence, and thus can yield different functional consequence predictions. The true functional consequence of an INDEL will depend on how the cellular machinery interprets the alternative sequence. The -eq option specifies the maximum number of consequences to report in this section. The consequences in this section are sorted such that those deemed to be most severe are reported first.
Ref_INDELs
refSNP (rs) IDs of known INDELs that are equivalent to the input INDEL. If the input INDEL matches a known INDEL then the rsID is given. If an equivalent INDEL to the input INDEL matches a known INDEL then the rsID of the known INDEL is given along with the details of the equivalent INDEL, in the format 'rsID(equivalent,position,reference allele,alternative allele)' where the word 'equivalent' indicates that the match was obtained for an equivalent INDEL, and the position, reference allele, and alternative allele describe the particular equivalent for which the match was obtained. For example, the value 'rs137977926;rs200444439(equivalent,20853505,-,GTTTT)' indicates that the input INDEL matches record rs137977926 and that an equivalent INDEL to the input INDEL at position 20853505 matches record rs200444439.
Is_Fully_Known
whether or not the alleles in the input INDEL are fully described by known refSNPs.

Output columns in flanking sequence files

The flanking sequence file consists of two columns:

INDEL_ID
a name assigned to the INDEL by NGS-SNP. The name consists of four values separated by underscores: chromosome, position, base on the forward strand in the reference sequence.
Sequence
the flanking sequence of the INDEL. The INDEL alleles are given in square brackets, in the middle of the sequence.

Directories and files produced when a directory name is supplied using the -split option

Annotated INDELs and flanking sequences grouped by functional class
a directory called "All" is created inside of the -split directory. Annotated INDELs and flanking sequences grouped by INDEL functional class are added to the "All" directory. For example, all 'stop_gained' INDELs can be found in the "All" directory, in one or more files with names beginning with "All_stop_gained" (multiple files for the same functional class are created when the number of INDELs in the functional class exceeds 100,000--each file is given a distinct numerical suffix). For each annotated INDEL file there is a matching flanking sequence file.
Annotated INDELs and flanking sequences grouped by functional class and chromosome
directories named after chromosomes are created inside of the -split directory. Within these chromosome directories are INDELs and flanking sequences organized by chromosome and INDEL functional class. For example, all inframe_insertion INDELs on chromosome 1 can be found in the "1" directory, in one or more files with names beginning with "1_inframe_insertion".
INDEL summary files
within the -split directory two files are created, called "chromosome_function_stats.txt" and "function_stats.txt". These files give the number of INDELs assigned to each functional class, and the number of known and novel INDELs. In the "chromosome_function_stats.txt" file the counts are given separately for each chromosome, whereas in the "function_stats.txt" file the entire dataset is summarized. Note that if the -all option is used the interpretation of these files differs, since more than one output INDEL can be generated for each input INDEL. For example, a single input INDEL might generate a stop_gained INDEL and an intron_variant INDEL when the -all option is used. Both of these INDEL consequences will be counted in the summary files. When the -all option is not used, only one consequence is reported for each input INDEL.

Tips

Usage

USAGE:
perl annotate_INDELs.pl [-arguments]

-i [STRING]        : file containing INDELs to annotate. The script will 
                     attempt to determine the input file format automatically, 
                     unless the -format option is used to indicate format 
                     (Required).
-o [FILE]          : the output file to create (Required).
-s [STRING]        : the source species in the form genus_species (e.g. 
                     'homo_sapiens', 'bos_taurus', 'sus_scrofa'). To view the
                     complete list of available species use the -show_species
                     option (Required).
-f [FILE]          : write genomic flanking sequence for each INDEL to this 
                     file (Optional).
-n [STRINGS]       : chromosome name changes to make when querying Ensembl. 
                     For example, 'chr30=chrX' causes all instances of 
                     'chr30' in the input file to be treated as 'chrX' 
                     for Ensembl searching. To view the chromosome names
                     recognized by Ensembl use the -show_chr option 
                     (Optional).
-d [INTEGER]       : amount of flanking genomic sequence to write on each 
                     side of INDELs when -f option is used (Optional; default 
                     is 100 bases).
-c                 : begin by counting the number of records already present 
                     in the output file and skip that number of input 
                     records. This option can be useful if the script was 
                     disrupted during a previous run. Do not use if the -all 
                     option was used when creating the existing output file.
                     Use of this option is no longer recommended--use the -p
                     option instead (Optional).
-p [FILE]          : file for tracking annotation progress. If the file does 
                     not exist, the script will create the file and use it to
                     keep track of which records have been annotated as
                     annotation progresses. If the file already exists, the
                     script will parse it to determine where the annotation 
                     process last ended, so that it can continue. Use this
                     option so that the annotation can continue with the first
                     unannotated INDEL in the event of a disrupted network
                     connection or error (Optional).
-L [FILE]          : log file to contain additional INDEL information and 
                     processing messages (Optional).
-v                 : provide verbose output (Optional).
-sk [FILE]         : file to contain list of input INDELs that were skipped
                     because they could not be processed (Optional). 
-no_multi          : skip multi-allelic INDELs (Optional; default is to 
                     annotate the first alternative allele).
-split [DIRECTORY] : split up the results based on chromosome and functional 
                     class and write to this directory (Optional).
-format [STRING]   : format of input file ('vcf') (Optional; default is to 
                     determine format from the input).
-model [STRING]    : the model species to use when filling in the 
                     'Model_Annotations' column in the output. The model
                     species can be the same as the source species. See the -s
                     option for information on species name format and on 
                     available species. (Optional; default value is 
                     'homo_sapiens').
-ncbi [FILE]       : the location of the ncbi_search.pl script (Optional; 
                     default is to locate automatically).
-uniprot [FILE]    : the location of the ebi_fetch.pl script (Optional; 
                     default is to locate automatically).
-no_ncbi           : do not perform annotation steps that involve retrieving
                     data from NCBI. Using this option will slightly speed up
                     the annotation process but will decrease the amount of
                     information provided in the 'Model_Annotations' column.
                     Do not run multiple instances of this script when not 
                     using this option (Optional; default is to perform steps 
                     that require data from NCBI).
-no_uniprot        : do not perform annotation steps that involve retrieving
                     data from UniProt. Using this option will slightly speed 
                     up the annotation process but will decrease the amount of 
                     information provided in the 
                     'Overlapping_Protein_Features' and 'Model_Annotations' 
                     columns. Do not run multiple instances of this script
                     when not using this option (Optional; default is to 
                     perform steps that require data from UniProt).
-all               : return all consequences for each INDEL. A single input 
                     INDEL can have multiple consequences for a variety of 
                     reasons (an INDEL may be located in the exon of one gene 
                     and the intron of another). The -all option causes each 
                     consequence to be described by a separate row in the 
                     output file. (Optional; default is to try to return the 
                     most severe consequence for each INDEL).
-status [STRING]   : the types of genes and transcripts to be used during 
                     annotation. Values "known" or "novel" can be specified. 
                     (Optional; default is to use all genes and transcripts).
-assembly [STRING] : the name of the assembly that was used as the reference
                     for INDEL discovery, for example, 'Btau_4.0' or 'NCBI34'.
                     If the name supplied using this option does not match
                     the name of the default assembly available in Ensembl 
                     an attempt is made to convert the INDEL locations and
                     alleles to the Ensembl assembly. To view the assembly
                     names recognized, use the -show_assem option (Optional; 
                     default is to assume the default Ensembl assembly was 
                     used for INDEL discovery).
-mismatch [INTEGER]: program behaviour when a reference allele in the input 
                     file disagrees with the genome assembly in Ensembl. 1 = 
                     exit with error; 2 = use the reference allele from the 
                     Ensembl assembly. Such mismatches can arise when the 
                     -assembly option is used to convert the INDEL locations 
                     and alleles to the Ensembl assembly (Optional; default 
                     is 1).
-host [STRING]     : the Ensembl database host (Optional; default is 
                     'ensembldb.ensembl.org').
-user [STRING]     : the Ensembl database user (Optional; default is 
                     'anonymous').
-pass [STRING]     : the Ensembl database password (Optional; default is '').
-port [INTEGER]    : the TCP port to use when connecting to the Ensembl 
                     database (Optional; default is to let the script choose 
                     the port).
-ann_db [DIRECTORY]: create or use flat file databases in this directory for
                     storing the annotated INDELs and flanking sequences 
                     generated by this script, so that future INDELs located at
                     the same position and involving the same alleles can be
                     annotated more quickly. The same directory can always be 
                     specified regardless of the species and other options 
                     being used--previously calculated annotations and 
                     flanking sequences are only applied to a new INDEL when 
                     appropriate (i.e. same species, alleles, etc.). This 
                     option is particularly useful when numerous INDELs from 
                     different individuals of the same species are to be
                     annotated by repeated use of this script.
-help              : provide list of arguments and exit (Optional).
-show_species      : show the species values that can be used with the -s, 
                     -cs, and -model options and then exit. The -host option
                     should also be given, unless the default host is 
                     to be used for annotation (Optional).
-show_assem        : show the assembly values that can be used with the 
                     -assembly option and then exit. The -s option is 
                     required with this option, to specify the species 
                     (Optional).
-show_chr          : show the chromosome names recognized by Ensembl and then
                     exit. This information is useful when using the -n
                     option. The -s option is required with this option, to
                     specify the species (Optional).

perl annotate_INDELs.pl -i indels.tab -o results.tab -s bos_taurus  -p progress